MaayanLab / rummagene

https://rummagene.com
Other
8 stars 3 forks source link

Some exact sets show up twice because they come from preprints and published papers of the same paper #47

Closed AviMaayan closed 9 months ago

AviMaayan commented 1 year ago
Screen Shot 2023-10-21 at 9 18 36 PM

Not sure this is a major issue... or an easy thing to fix. It came from the first user that reported any issue.

u8sand commented 1 year ago

Seems a tad tricky given that there are completely separate PMC ids and tables. Perhaps if there is some metadata somewhere about it, I'll look into it.

Update:

I dug a bit and was able to find a reference in crossref pointing from the preprint's DOI to the published article DOI https://api.crossref.org/works/10.1101/2021.02.20.431155

{"relation":{"is-preprint-of":[{"id-type":"doi","id":"10.1016\/j.cell.2021.07.023","asserted-by":"subject"}]}}

The way to address this I think is to capture these relationships in the database and just omit results of preprints containing published articles also in the database. We at least have DOIs for the PMCs but we need to find a good endpoint for bulk download of DOI metadata (perhaps it's just the crossref API I referenced above).

Update:

The following API query seems likely to get us what we want specifically: https://api.crossref.org/works?select=DOI%2Crelation&filter=doi%3A10.1101%2F2021.02.20.431155%2Cdoi%3A10.1016%2Fj.cell.2021.07.023

i.e. we can grab potential relation.is-preprint-of records for DOIs in batches.