ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 2 forks source link

Treat identical citations always as duplicates? #160

Closed LukasWallrich closed 1 week ago

LukasWallrich commented 1 year ago

Currently, CiteSource does not always treat identical citations as duplicates - if they are not complete enough, ASySD does not achieve sufficient confidence. For instance, if we import the working example final.ris with 242 results twice, ASySD finds 272 unique citations before manual deduplication.

I would be minded to add a default in CiteSource that treats identical entries as duplicates, if this appears too risky for ASySD - as it stands, this means that summaries across stages are predictably misleading until one completes the manual deduplication (which makes CiteSource less useful for quick exploration than it could be ...)

@kaitlynhair @TNRiley what are your thoughts?

TNRiley commented 1 year ago

I'd like to take a look at that final.ris twice example. Interested in what "enough" means exactly and what metadata those records are missing. We should provide users with instructions to ensure specific fields are complete, but could see this as an argument in the dedup.

TNRiley commented 1 year ago

I ran the same file of 242 final articles twice. Found that there were 34 pairs that were not identified as duplicates, however, they did come up on the manual deduplication.

If you proceed without the manual deduplication, you get the following pop-up, which is confusing. I'm not sure where the 272 number is coming from. The 484 makes sense as that is the 242 x2, the 34 also makes sense as it's the number of pairs that I mentioned above.

captures_chrome-capture-2023-5-21

The upset plot and individual record table are both off too, each showing 30 citations in each source that are unique. So I'm not sure how these show 30 unique, instead of 34, which were identified as potential duplicates. captures_chrome-capture-2023-5-21 (1)

This issue is more of a metadata quality issue, however, I do agree that we should reach some consensus on exact matches. Something like if at least x number of fields are exact matches they are identified as duplicates. There may also be specific combinations we want to identify (eg. IF title and DOI are an exact match)

TNRiley commented 1 year ago

I'm going to add a discussion thread about building a test .ris file. This file should include known duplicates and false positives. We can easily label these in citesource to test various deduplication changes.

TNRiley commented 1 week ago

Deduplication improvements should be handled in ASySD, I think that any examples we can provide will be helpful in improving deduplication/matching algorithms in the future. Closing this as I believe it's ASySD focused.