ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
17 stars 1 forks source link

Deduplication & Table errors when uploading files without URL field (3 Web of Science .ris example) #179

Closed TNRiley closed 3 months ago

TNRiley commented 5 months ago

This appears to be limited to a specific case study I'm working to document. These errors do not come up when I use other .ris files. My thought is that something is happening with the WOS identifier or something related.

Running three searches in Web of Science - (simple variations on a string strategy) - export full record for each

  1. (whale OR cetacean) AND ((passive AND acoustic) AND (monitor OR record OR detect*)) n=642
  2. (whale OR cetacean) AND (“passive acoustic” AND (monitor OR record OR detect*)) n=557
  3. (whale OR cetacean) AND (“passive acoustic monitoring” OR “passive acoustic recording” OR “passive acoustic detection”) n=367

1st error occurs if you do any manual deduplication, all tables and visuals will show an error 2nd error that is constant regardless if you do manual deduplication is that the individual record table throws and error

Note: when deduplicating I do get the pop up that says that of the 1566 records, there were 642 unique records (which makes no sense as this is the number of records from v1 and there is complete overlap between v2 and v3) Furthermore, there is 1 set of potential duplicates which for some reason were not automatically identified despite all data coming from WOS (this could be a metadata issue so would need to be reviewed later)

DrMattG commented 5 months ago

I get an error because it is looking for the cite_string column and that does not exist in the dataframes. Is this because they are not declared in the function like cite_source and cite_label are? https://github.com/ESHackathon/CiteSource/blob/415f910ee99cebfaaf0570d5e2b767721bebb299/R/dedup.R#L41

TNRiley commented 5 months ago

I've uploaded the .ris and a script in the test file folder.

TNRiley commented 5 months ago

Running these files in R, rather than the shiny throws an error for the record-level table. It appears that the URL column is the issue for some strange reason. Looking at the data the URL seems to be missing from the beginning with all these WoS files and others. I think that the second error with the table is due to the fact that this is the first time I've run test with ONLY WoS .ris files. When other database files are included the URL column is added and you end up with an NA for the WoS citations. Most likely the way to fix this is to make the record-level table not reliant on the URL column to run.

Error in `dplyr::mutate()`:
ℹ In argument: `reference = generate_apa_reference(...)`.
Caused by error in `.data$url`:
! Column `url` not found in `.data`.
Run `rlang::last_trace()` to see where the error occurred.
TNRiley commented 5 months ago

@LukasWallrich can you take a look at how the record_level_table and the generate_apa_reference functions can be changed to work if there is no URL column due to it not being included in any of the citation files/metadata? I've tried but have not been successful.

Also removing the script for the testing from the test folder due to the CMD check failure. I'll keep the .ris files in "shinytest"in test folder

LukasWallrich commented 3 months ago

@TNRiley I changed the record_level_table(), can you check if the issue persists?

TNRiley commented 3 months ago

@LukasWallrich I'm still running into an error in both the shiny and R. I've been unsuccessful in troubleshooting, it's hung up on the weblink column needing to be a character type and despite converting and checking I still get the error.

TNRiley commented 3 months ago

Also the record_level_table is written in a way that it uses the "citations" tibble, but all the tests I have been running and all examples which are on the vignettes are using "unique_citations" which seems correct. I believe the record_level_table function needs to be corrected to ensure it's pointing to the deduplicated unique_citations data, but need to review further.

LukasWallrich commented 3 months ago

Sorry about that - this bug fix was too ad-hoc.

I now moved the error and type checking into the generate_apa_reference() function, and also fixed the reference generation for single-name authors. I also added tests that should prevent URL and type errors (or other missing column issues) from reoccurring. @TNRiley can you test it again, and also keep an eye out for references that are misformatted?

Re citations, the function uses whatever argument you pass to it. So if you call record_level_table(unique_citations), then R translates that to record_level_table(citations = unique citations) because unnamed arguments are allocated in order, and any reference to citations within the function refers to the value of that argument. We can rename the argument (and all references to it within the function) to offer a consistent user interface - but given that all functions after dedup require unique citations, I am wondering whether we should rather rename it all to citations? While this does not change functionality, consistency will make for a better user experience (and this should not change after the first release as it is a breaking change.) [I will create a new issue for this, so that we can close this if URL is solved.]

TNRiley commented 3 months ago

I tested it in local shiny and it worked, the small sample set of citations I viewed also looked good and was formatted correctly.