Yingjie4Science / SDGdetector

A novel R package that can identify and visualize 17 Sustainable Development Goals and associated 169 Targets in text
GNU General Public License v3.0
15 stars 1 forks source link

JOSS Review - csaybar #4

Closed csaybar closed 1 year ago

csaybar commented 1 year ago

Hi @Yingjie4Science,

I'm having a lot of struggle reviewing your package/paper. Sorry maybe is my fault (it's my first time as a JOSS reviewer 😬). On the JOSS website it says Version: v0.1.0 (https://github.com/openjournals/joss-reviews/issues/5124) but this version does not have the functionality you describe in the paper. It looks like a very early version of your package.

The latest version doesn't install on my computer.

remotes::install_github("Yingjie4Science/SDGdetector")
Warning in file(filename, "r", encoding = encoding) :
  cannot open file './Code/function_lookaround_nearby_n.R': No such file or directory
Error in file(filename, "r", encoding = encoding) : 
  cannot open the connection
ERROR: lazydata failed for package ‘SDGdetector’

You should check this R file: https://github.com/Yingjie4Science/SDGdetector/blob/ed5358ba7456f5504cc106aa973cbfedcad2ab5e/data/helper_SDG_search_terms.R#L92

I would really appreciate a new release of SDGdetector with the code to review.

Best!

csaybar commented 1 year ago

Hi, @Yingjie4Science can you create a new release of SDGdetector to start the review? Thanks!

Yingjie4Science commented 1 year ago

Yes! A new release has been created. Thank you!

csaybar commented 1 year ago

Hello @Yingjie4Science, I hope this message finds you well. Please find attached my comments below regarding SDGdetector:

General comments:

SDGdetector is a tool that helps to classify whether a paragraph is related to the United Nation's Sustainable Development Goals (SDGs). This is essentially a text classification problem with 18 classes (17 SDGs and one additional class for texts that do not match any of them).

The authors have created a hashmap with 557 regular expressions, each of them associated with a specific SDG. For example, the first regular expression is associated with the SDG "No Poverty".

(sdg|goal)[^0-9]{0,2}(?=1\\b)|No Poverty

Here, the authors can directly detect words such as sdg1, goal1, sdg:1, sdg-1, goal:1, and No Poverty using this regex. The use of lookahead assertion serves as the foundation for almost all the regex that the authors propose. However, one limitation of this tool is its ability to detect "indirect" (referred to this way by the authors in the code) relationships. For instance, in regular expression 213:

^(?=.*(?:north-south divide\\S*|financial flow.?|resource flow.?|foreign direct investment.?|\\bFDI\\b|\\bODA\\b))(?=.*(?:((Aruba|Afghanistan|Angola|Anguilla|Albania|United Arab Emirates|Argentina|Armenia|American Samoa|French Southern Territories|Antigua and Barbuda|Azerbaijan|Burundi|Benin|Burkina Faso|Bangladesh|Bahrain|Bahamas|Bosnia and Herzegovina|Belarus|Belize|Bolivia|Brazil|Barbados|Brunei Darussalam|Bhutan|Botswana|Central African Republic|Chile|China|C.?te d.?Ivoire|Cameroon|Congo\\b|Cook Islands|Colombia|Comoros|Cabo Verde|Costa Rica|\\bCuba\\b|Curaçao|Cayman Islands|Cyprus|Cyprus|Djibouti|Dominica|Dominican Republic|Algeria|Ecuador|Egypt|Eritrea|Western Sahara|Ethiopia|Fiji|Micronesia|Gabon|Georgia|Ghana|Guinea|Gambia|Guinea.?Bissau|Equatorial Guinea|Grenada|Guatemala|Guam|Guyana|Heard Island and McDonald Islands|Honduras|Haiti|Indonesia|\\bIndia\\b|\\bIran\\b|\\bIraq\\b|Jamaica|Jordan|Kazakhstan|Kenya|Kyrgyzstan|Cambodia|Kiribati|Saint Kitts and Nevis|Kuwait|\\bLao.?\\b|Lebanon|Liberia|Libya|Saint Lucia|Sri Lanka|Lesotho|Macao|Saint Martin|Morocco|Moldova|Madagascar|Maldives|Mexico|Marshall Islands|North Macedonia|\\bMali\\b|Myanmar|Montenegro|Mongolia|Northern Mariana Islands|Mozambique|Mauritania|Montserrat|Mauritius|Malawi|Malaysia|Namibia|New Caledonia|Niger|Norfolk Island|Nigeria|Nicaragua|Niue|Nepal|Nauru|Oman|Pakistan|Panama|Pitcairn|Peru|Philippines|Palau|Papua New Guinea|Puerto Rico|Paraguay|Palestine|French Polynesia|Qatar|Russia|Rwanda|Sudan|Senegal|Singapore|South Georgia and the South Sandwich Islands|Saint Helena|Solomon Islands|Sierra Leone|El Salvador|Somalia|Serbia|South Sudan|Sao Tome and Principe|Suriname|Eswatini|Sint Maarten|Seychelles|Syrian|Turks and Caicos Islands|Chad|Togo|Thailand|Tajikistan|Turkmenistan|Timor.?Leste|To>>> nga|Trinidad and Tobago|Tunisia|Turkey|Tanzania|Uganda|Ukraine|Uruguay|Uzbekistan|Saint Vincent and the Grenadines|Venezuela|Virgin Islands|Viet Nam|Vanuatu|Wallis and Futuna|Samoa|Yemen|South Africa|Zambia|Zimbabwe|Somaliland|Kosovo|Ashmore|Cartier|Siachen Glacier|North Korea|\\bABW\\b|\\bAFG\\b|\\bAGO\\b|\\bAIA\\b|\\bALB\\b|\\bARG\\b|\\bARM\\b|\\bASM\\b|\\bATF\\b|\\bATG\\b|\\bAZE\\b|\\bBDI\\b|\\bBEN\\b|\\bBFA\\b|\\bBGD\\b|\\bBHR\\b|\\bBHS\\b|\\bBIH\\b|\\bBLR\\b|\\bBLZ\\b|\\bBOL\\b|\\bBRA\\b|\\bBRB\\b|\\bBRN\\b|\\bBTN\\b|\\bBWA\\b|\\bCAF\\b|\\bCHL\\b|\\bCHN\\b|\\bCIV\\b|\\bCMR\\b|\\bCOD\\b|\\bCOG\\b|\\bCOK\\b|\\bCOL\\b|\\bCOM\\b|\\bCPV\\b|\\bCRI\\b|\\bCUB\\b|\\bCUW\\b|\\bCYM\\b|\\bCYP\\b|\\bCYP\\b|\\bDJI\\b|\\bDMA\\b|\\bDOM\\b|\\bDZA\\b|\\bECU\\b|\\bEGY\\b|\\bERI\\b|\\bESH\\b|\\bETH\\b|\\bFJI\\b|\\bFSM\\b|\\bGAB\\b|\\bGEO\\b|\\bGHA\\b|\\bGIN\\b|\\bGMB\\b|\\bGNB\\b|\\bGNQ\\b|\\bGRD\\b|\\bGTM\\b|\\bGUM\\b|\\bGUY\\b|\\bHMD\\b|\\bHND\\b|\\bHTI\\b|\\bIDN\\b|\\bIND\\b|\\bIRN\\b|\\bIRQ\\b|\\bJAM\\b|\\bJOR\\b|\\bKAZ\\b|\\bKEN\\b>>> |\\bKGZ\\b|\\bKHM\\b|\\bKIR\\b|\\bKNA\\b|\\bKOR\\b|\\bKWT\\b|\\bLAO\\b|\\bLBN\\b|\\bLBR\\b|\\bLBY\\b|\\bLCA\\b|\\bLKA\\b|\\bLSO\\b|\\bMAF\\b|\\bMDA\\b|\\bMDG\\b|\\bMDV\\b|\\bMEX\\b|\\bMHL\\b|\\bMKD\\b|\\bMLI\\b|\\bMMR\\b|\\bMNE\\b|\\bMNG\\b|\\bMNP\\b|\\bMOZ\\b|\\bMRT\\b|\\bMSR\\b|\\bMUS\\b|\\bMWI\\b|\\bMYS\\b|\\bNAM\\b|\\bNCL\\b|\\bNER\\b|\\bNFK\\b|\\bNGA\\b|\\bNIC\\b|\\bNIU\\b|\\bNPL\\b|\\bNRU\\b|\\bOMN\\b|\\bPAK\\b|\\bPAN\\b|\\bPCN\\b|\\bPHL\\b|\\bPLW\\b|\\bPNG\\b|\\bPRK\\b|\\bPRY\\b|\\bPSE\\b|\\bPYF\\b|\\bQAT\\b|\\bRUS\\b|\\bR>>> WA\\b|\\bSDN\\b|\\bSEN\\b|\\bSGP\\b|\\bSGS\\b|\\bSHN\\b|\\bSLB\\b|\\bSLE\\b|\\bSLV\\b|\\bSOM\\b|\\bSRB\\b|\\bSSD\\b|\\bSTP\\b|\\bSUR\\b|\\bSWZ\\b|\\bSXM\\b|\\bSYC\\b|\\bSYR\\b|\\bTCA\\b|\\bTCD\\b|\\bTGO\\b|\\bTHA\\b|\\bTJK\\b|\\bTKM\\b|\\bTLS\\b|\\bTON\\b|\\bTTO\\b|\\bTUN\\b|\\bTUR\\b|\\bTZA\\b|\\bUGA\\b|\\bUKR\\b|\\bURY\\b|\\bUZB\\b|\\bVCT\\b|\\bVEN\\b|\\bVIR\\b|\\bVNM\\b|\\bVUT\\b|\\bWLF\\b|\\bWSM\\b|\\bYEM\\b|\\bZAF\\b|\\bZMB\\b|\\bZWE\\b|global south|Third World|Pacific Alliance))|((?=.*(?:developing|least.?develop\\S*|less.?develop\\S*|underdevel\\S*|\\bpoor|low.?income|lower.?income|small island\\S*|africa\\S*|\\bimpover\\S*|\\bpover\\S*|\\bemergent\\S*))(?=.*(?:countr\\S*|\\bnation.?|\\bstate.?\\b))))).+

As evidenced (a regular expression with 3700 characters), there is an overuse of positive lookahead assertions which may not be efficient in time when dealing with long sentences. Additionally, the current implementation is susceptible to being easily fooled, yielding false positives. For example:

library(SDGdetector)
text <- 'The food from the fdi institute (India) is not good'
SDGdetector(text)
# SDG10_b, SDG17_3

The authors should provide some kind of metric that shows the efficiency of this tool compared to other text classification models. I am quite sure that fine-tuning pre-trained lightweight models such as nbroad/ESG-BERT could provide better results in finding "indirect" relationships with less computing time.

Coding comments:

Yingjie4Science commented 1 year ago

Hi @csaybar,

Thank you so much for your very comprehensive and constructive comments!

You are right that one of the key features of this package is to address a text classification problem. In addition to the classification at the Goal level (i.e., 17 SDGs + 1 not-SDG class), I wanted to highlight that our package also deals with classification at the (sustainable development) target level (i.e., 169 specific targets + not-targets). This is a major advance compared to other existing tools that can only do classification at the goal level. This is very important because the 17 goals are rather broad and vague, scientific communities and policymakers are often more interested in the more specific targets under the UN SDG framework.

Thanks for sharing the BRRT model - nbroad/ESG-BERT, which is a promising direction we wanted to go and hope to include in the near future. We actually tried BERT but the accuracy at the target level was rather low. This is partly because training a model for classification on > 169 categories requires a huge amount of training samples. Another reason we decided to proceed with the current regex-based approach, which is more flexible, traceable, and adaptable (e.g., our team is developing a similar package in Python based on this database). In terms of the drawbacks, as you have noted, this can be slow. We thus suggested users split long sentences into short clauses (in the paper), and we also added a warning message in the function when the input text exceeds 750 characters.

We included a section on the efficiency of this tool in the paper and also on the GitHub page (https://github.com/Yingjie4Science/SDGdetector#accuracy-evaluation). Briefly, this package's accuracy is at > 75%, measured by the alignment between the R package results and four experts' manually-coded results. See this supplementary document for more information.

We also appreciate your comments on the code, and we have addressed all of these issues. More specifically, we have

Thank you again for your time and very helpful comments & suggestions!

csaybar commented 1 year ago

Hi @Yingjie4Science,

Thank for you your reply! You can check the current code coverage running:

"""r library(covr) covr::package_coverage()

SDGdetector Coverage: 18.78% R/add_sdg_pattern.R: 0.00% R/detect_region.R: 0.00% R/helper_SDG_search_terms.R: 0.00% R/plot_sdg_bar.R: 0.00% R/plot_sdg_map.R: 0.00% R/sdg_color.R: 0.00% R/sdg_icon.R: 0.00% R/SDGdetector.R: 84.31% """

https://covr.r-lib.org/

Yingjie4Science commented 1 year ago

Thanks, @csaybar! This is super helpful. We have added more tests, and the current coverage is 93.45%. See details https://github.com/Yingjie4Science/SDGdetector/commit/bfed791e877e82ceafb7bbe25b0ebfad95b11b2a:

covr::package_coverage()

SDGdetector Coverage: 93.45% R/SDGdetector.R: 82.05% R/add_sdg_pattern.R: 95.45% R/helper_SDG_search_terms.R: 95.45% R/plot_sdg_bar.R: 97.30% R/detect_region.R: 100.00% R/plot_sdg_map.R: 100.00% R/sdg_color.R: 100.00% R/sdg_icon.R: 100.00%

csaybar commented 1 year ago

Hi, @Yingjie4Science, I have finished up the checklist at openjournals/joss-reviews#5124 and as far as my review is concerned, you are all set. Your edits look very good in both the paper/code.

Yingjie4Science commented 1 year ago

Hi @csaybar , thank you again for your time and very constructive comments! I learned many new things from you and the process!