large number of variables on upload

TNRiley commented 1 year ago

Here is the info on the 5 larger files I have from our recent systematic map Dimensions ~ 23k citations - 53.7 Mb (20 variables) LENS ~ 23k citations - 71Mb (105 variables) ProQuest ~16k citations - 94.9Mb (256 variables) Web of Science ~ 22k citations - 128.4Mb (477 variables) Scopus ~29k citaitons - 175.9Mb (433 variables)

I've temporarily saved the RData so that folks can take a look for themselves. it's in the vignettes/troubleshooting folder.

I'm extremely surprised by the number of variables in Scopus and WoS. These were all read in using sythesisr_read_ref()

Originally posted by @TNRiley in https://github.com/ESHackathon/CiteSource/issues/136#issuecomment-1544022466

Closed the max size upload issue, but wanted to continue to investigate this.

LukasWallrich commented 1 year ago

We should certainly continue - so here my comment again:

@TNRiley in WoS, I looked for a couple of records with 'odd' fields - could you extract the RIS entries for them? I started with WoS, only to realise that they are exactly the ones where you did not upload the ris.

10.3390/w12041097 (only TT) 10.1016/j.jenvrad.2004.05.002 and SSOLVED ORGANIC-MATTER (only WN, doi not read correctly) 10.1061/(ASCE)IR.1943-4774.0000319 (only CQ) One other approach would be to have a look at invalid fields - those are probably the entries where things start to go wrong. In refsWoS, there are 114 DOI fields not containing DOIs - e.g., rows 68 and 83. Could you maybe try to import them separately, and if that works, then with as many in front of them until things go wrong? (To check whether there are DOIs, I used refsWoS %>% mutate(rownum = row_number()) %>% filter(!str_detect(doi, fixed("10."))) %>% pull(rownum)

TNRiley commented 1 year ago

I've brought in all of the .ris files to the vignettes Test_RIS folder. I'm also going to add the troubleshooting vignette back for now and add the troubleshooting.RData to the vignettes folder too as this is a big dataset and takes some time to run.

I also wanted to experiment with the import and brought the .ris in via CiteSource::read_citations() and synthesisr::read_ref() - I had used synthesisr previously and wanted to just double check. Looks like citesource brings in the WOS.ris with 121 variables, while sythesisr is bringing it in with 477. There is a difference in the number of observations by a little less than 1k, which I'm wondering about, but overall the number of variables is much better with CS.

You will also see that I did a count across the variables to see how many records were using each variable. This should help us in reviewing potentially unnecessary data that could either be ignored when reading in or removed afterward.

I wasn't clear if you had wanted .ris files for these specific 3 records, obviously, they will be in the Test_RIS data, but did you want to have the .ris for them as individual records?

TNRiley commented 1 year ago

Wasn't able to upload the RData due to size, so if you're taking a look at it know that it will take a bit of time.

I plan on running the count across each of the .ris so that we can discuss this at our next meeting.

TNRiley commented 1 year ago

Some initial observations:

PQ, Scopus, WoS are the worst offenders (80, 108, 121 variables respectively). All other sources have 30 or less variables.

ZZ - this column has one record in each .ris and it seems like it's only the first record (TY - JOUR) the next column is source type which contains the JOUR or BOOK or CONF. Seems like something is happening on import where the first record is not conforming like the rest.

supertaxa: seems to just be a repetition of the title.

TNRiley commented 1 year ago

@LukasWallrich @kaitlynhair what are your thoughts on moving all of our troubleshooting stuff to a new branch? I want to try and get things launched. This stuff is worthy of further investigation and tinkering, but I don't want to have it hold things up.

LukasWallrich commented 1 year ago

RIS import is hard - and I am not sure that CiteSource is the right place to fix it. revtools and synthesisr are trying to do it with the limitations we have here (and seem to have stopped development), litsearchr is experiencing bugs and has not been touched since 2011 ... so this might be something for future ESHackathons to address?

I tried to track down the issue, but am now very confused - in the WoS import, I identified one broken example, but cannot find the information contained in the last 6 fields anywhere in the ris file. Not sure if my text editor cannot handle the large file - but to effectively troubleshoot, we would need a smaller file where similar issues appear.

citations %>%
  dplyr::filter(doi == "https://doi.org/10.1016/j.apgeochem.2005.04.003") %>%
  dplyr::select(!is.na) %>%
  dplyr::glimpse()
#> Rows: 1
#> Columns: 31
#> $ database        "PQ"
#> $ author          "Piani, Raffaella and Covelli, Stefa…
#> $ year            "2005"
#> $ title           "Mercury contamination in Marano Lag…
#> $ source          "Applied Geochemistry"
#> $ volume          "20"
#> $ issue           "8"
#> $ start_page      "1546-1559"
#> $ abstract        "Total Hg concentrations and Hg spec…
#> $ doi             "https://doi.org/10.1016/j.apgeochem…
#> $ accession_zr    "19422580; 6644904"
#> $ url             "https://login.proxy.lib.duke.edu/lo…
#> $ supertaxa       "Mercury contamination in Marano Lag…
#> $ ID              "195067"
#> $ source_type     "JOUR"
#> $ notes           "Date revised - 2007-02-01 and Subje…
#> $ cite_source     "PQ"
#> $ cite_label      "search"
#> $ chemicals       "Date revised - 2007-02-01 and Subje…
#> $ address         "Department of Geological, Marine an…
#> $ issn            "0883-2927, 0883-2927"
#> $ date_generated  "August 2005 and 2017-05-24"
#> $ keywords        "cinnabar and Aqualine Abstracts and…
#> $ language        "English"
#> $ Q5              "08503:Characteristics, behavior and…
#> $ SW              "3030:Effects of pollution"
#> $ AQ              "00008:Effects of Pollution"
#> $ Q2              "09264:Sediments and sedimentation"
#> $ Q3              "08587:Diseases of Cultured Organism…
#> $ Q4              "27750:Environmental"
#> $ NAs             154

For now, I would suggest adding an option to read_citations that drops any RIS fields that are not used by citesource - but I am not sure whether that should default to TRUE or FALSE? For now, I defaulted it to TRUE so that it does not require any extra handling in Shiny, but that means that users that use CiteSource will not get their full data back by default. Also, I kept the list to the variables we actually use (see key_fields in CiteSource.R) - happy to add some common metadata that you think users might often want to retain

FYI: I now added a progress bar to read_citations as it can be rather slow ... and fixed a clumsy implementation in the as.data.frame.bibliography to reduce the reading time of the WoS dataset from 65s to 13s on my computer.

TNRiley commented 1 year ago

@kaitlynhair and I talked about .ris last meeting and the difficulties due to lack of any real standardization. I've always considered .ris as the norm, but I'm tempted to look at how .bib usage stacks up. I'd also be interested in looking at what the library lit says about format standards and interoperability. ESHackathon could be a good place, @rootsandberries is involved with library carpentry and I could see folks in that group having interest in this topic as well.

Defaulting to True sounds like a good way to go about things, we can add a warning and potentially to mention what fields it's keeping. We can also add information into the vignettes.

I've added language that recommends users bringing metadata into a citation manager before importing to CS. Beyond combining multiple.ris from a single source it does provide some benefits it seems.

TNRiley commented 1 year ago

closed - @LukasWallrich added the only_key_fields argument to the read_citations function. The issue of .ris quality and variation may be something that ESHackathon works to address in the future

ESHackathon / CiteSource

large number of variables on upload #146