Difference in deduplication between RAW and PROCESSED data

@kaitlynhair @LukasWallrich This issue continues to come up, though I have not had an example where the RAW data performed better than the PROCESSED data. For RAW data I am referring to the raw .ris files exported from sources. PROCESSED data is considered raw .ris that have been brought into a citation management software (potentially combined with multiple raw files) and then exported as a .ris

This example uses the data available in the benchmarking vignette. I have combined the multiple raw files into single .ris exports and run the exact same code. As you can see from the screenshots below, the processed .ris show that 17 benchmarking articles were not caught, while the raw version shows only 6 (which is correct). Furthermore, the citations in the table have DOI metadata from multiple records, it appears.

PROCESSED RAW

ESHackathon / CiteSource

Difference in deduplication between RAW and PROCESSED data #151