ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 1 forks source link

deduplication issue - potentially with Dimensions metadata #99

Closed TNRiley closed 1 year ago

TNRiley commented 1 year ago

There are two-letter metadata columns that are not aligning with fully named fields (do/DOI , JO/Journal)

Emailed a quick video on chasing a deduplication problem down to a potential differnece in metadata.

To look at the unique DIM items from the spreadsheet I reviewed the unique DIM records and varified that they were not unique by searching a endnote library that contained all the citations from each database. The looked a the dedup results data

dedup_results_unique<-(dedup_results$unique)

The example I provided was 10006. You can see how the variations of the field names are potentially causing the issue.

TNRiley commented 1 year ago

took the dimensions .ris and imported into endnote, exported and reloaded them into citesource, same issue. Thought that this might adjust the field names, no luck. Not able to see why these abbreviated fields are being created or why full-name fields are empty.

TNRiley commented 1 year ago

Worked on this a bit more and I think that it's related to the manual deduplication. When running the dedup_citaitons function with manual_dedup=FALSE, the numbers almost swap. image image

TNRiley commented 1 year ago

wiped my environment and ran both again. Both manual TRUE and FALSE created accurate results with the majority of the DIM citations as overlap. I believe that I must have had something in my environment that was creating the error. My best guess is that this is related to the original Dimensions RIS that I had uploaded. This will need to be reviewed so I'm keeping this open for now.

TNRiley commented 1 year ago

verified that this issue is related to #96 - both the raw dimensions and the raw psycinfo .ris (relative to each issue) was the problem. A new issue should be created so that a check on .ris is performed. This issue was solved by importing problematic .ris files into endnote and then exporting them as a new .ris - EndNote must account for this issue and export in a standardized format.