ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 1 forks source link

Difference in deduplication between RAW and PROCESSED data #151

Closed TNRiley closed 1 year ago

TNRiley commented 1 year ago

@kaitlynhair @LukasWallrich This issue continues to come up, though I have not had an example where the RAW data performed better than the PROCESSED data. For RAW data I am referring to the raw .ris files exported from sources. PROCESSED data is considered raw .ris that have been brought into a citation management software (potentially combined with multiple raw files) and then exported as a .ris

This example uses the data available in the benchmarking vignette. I have combined the multiple raw files into single .ris exports and run the exact same code. As you can see from the screenshots below, the processed .ris show that 17 benchmarking articles were not caught, while the raw version shows only 6 (which is correct). Furthermore, the citations in the table have DOI metadata from multiple records, it appears.

PROCESSED image image RAW image image

TNRiley commented 1 year ago

Had a chance to review things further and found that the benchmarking .ris had a number of records with duplicate lines in the DOI. I removed these and everything looks good. @kaitlynhair just something to be aware of for ASySD. Maybe it would be possible to take only the first line and strip anything after a line break?