bio-tools / biotoolsRegistry

biotoolsregistry : discovery portal for bioinformatics
GNU General Public License v3.0
70 stars 21 forks source link

Replace bioRxiv DOIs with published article DOIs #331

Open jaanisoe opened 6 years ago

jaanisoe commented 6 years ago

Following the bioRxiv DOI will of course also enable reading of the article, but for those bioRxiv entries that have a corresponding published DOI, it would be better to use this published DOI.

For example, bio.tools entry with ID "whatshap" has a publication with DOI 10.1101/037101. Following the DOI we can see on the bioRxiv page the text "Now published in Bioinformatics doi: 10.1093/bioinformatics/btw276".

Querying Europe PMC with the bioRxiv DOI will return 0 results: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.1101/037101

Querying Europe PMC with the published article DOI will return the corresponding article: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.1093/bioinformatics/btw276

The same happens when Scopus is queried: the bioRxiv DOI returns "No documents were found." and the published article DOI returns the corresponding article.

Not sure if this related, but on the tool page in https://bio.tools/whatshap, we can see Publication details for the first publication (10.1089/cmb.2014.0157), but not for the other attached publication (10.1101/037101) from bioRxiv.

Other such bioRxiv DOIs: 10.1101/018788 -> 10.1038/ng.3467 10.1101/109728 -> 10.1186/s12859-017-1708-7 10.1101/110387 -> 10.1101/gr.222109.117 10.1101/012682 -> 10.1038/ng.3244 10.1101/051631 -> 10.1371/journal.pgen.1006599

So, replacing bioRxiv DOIs with published article DOIs will enable better article metadata retrieval. However, finding such bioRxiv DOIs, as articles get published over time, is extra work, and this is maybe a rather low priority problem.

joncison commented 6 years ago

This is important and should be fixed, but ...

@jaanisoe : is there an automated way we can we know whether a DOI is a bioRxiv one? In that case, we could automatically identify (and fix) them.

@hansioan suspects that if it starts with 10.1101 and ends with 6 numbers then it's a bioRxiv DOI, but we're not sure.

jaanisoe commented 6 years ago

I'm not sure about the bioRxiv DOIs either, but I'd guess it is as Hans suspects. Maybe they'll increase the number of digits from the current 6 once too many articles have been published. Note that the DOI registrant code 1101 is not enough to identify a bioRxiv DOI, as for example the DOI 10.1101/gr.1239303 points to a different journal (from CSHL that also hosts bioRxiv).

Some time after I first reported the bioRxiv issue, Europe PMC started to support preprints (blog post). Which means that the following reported API query does not return 0 results anymore, but 1 result from the PPR <source>: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.1101/037101

But I guess it would still be better to use the DOI of the published article, as among other things we could get a PMID and PMCID for the article then, MeSH keywords, it's maybe more authoritative, etc: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.1093/bioinformatics/btw276

Also, Scopus still doesn't seem to like the bioRxiv DOIs (so there is no publication metadata in bio.tools for bioRxiv DOIs). But Altmetric works with both DOIs: https://api.altmetric.com/v1/doi/10.1101/037101 https://api.altmetric.com/v1/doi/10.1093/bioinformatics/btw276 In this case, the score of the bioRxiv preprint seems even higher than that of the corresponding published article (I guess this can be expected as it takes into account social media).

Also, as explained in their blog post, Europe PMC is crosslinking the preprint and its peer-reviewed article (using Crossref). So if the Europe PMC API is called with the bioRxiv DOI, we should be able to get the ID of the published article: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.1101/037101 The ID seems to be in the <commentCorrection> of <type> "Preprint of" from <source> "MED" in the element <id>. So 27307622 should be the PMID of the published article corresponding to the preprint with DOI 10.1101/037101. From this PMID we can easily get a PMCID and a DOI, maybe like this: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=ext_id:27307622%20src:med

jaanisoe commented 6 years ago

So the issue could more generally be about preprints. The most common platform for them (as used in bio.tools) besides bioRxiv is probably F1000Research. An article in F1000 can have subsequent versions, each with its own DOI (the difference between the DOIs is the last number marking the version). If a version gets enough approvals, it will be indexed in different databases. Among others, the version with enough approvals will be indexed in PubMed and receive a PMID and a PMCID, so the DOI of that version is the one that should probably be used in bio.tools.

Most F1000 DOIs in bio.tools are correct in that sense, but there are a few that could be changed. For example 10.12688/f1000research.9259.1, which is specified for https://bio.tools/genebreak.

Quering the Europe PMC API with this DOI: https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&query=doi:10.12688/f1000research.9259.1 Two results are returned. One from the PPR <source>, from where we can see in <commentCorrection> that the DOI is a preprint of PMID 28713543. The corresponding DOI for this PMID is 10.12688/f1000research.9259.2, so the second version of this F1000 article is the one that has passed peer review. The other result returned by the Europe PMC API is from the MED <source>, where we can directly get all metadata for the second version of the article. Unfortunately, the API does not always return the indexed version of a F1000 article from the MED <source> (for example for 10.12688/f1000research.11022.1 only one result from the PPR <source> is returned and 10.12688/f1000research.11022.3 needs to be called explicitly for the MED <source>), so it is better to got through the PMID found in <commentCorrection> of the PPR <source>.

The Altmetric widget is missing for https://bio.tools/genebreak, because the Altmetric API returns "Not Found" for version 1 of the article. But it returns the results for version 2: https://api.altmetric.com/v1/doi/10.12688/f1000research.9259.1 https://api.altmetric.com/v1/doi/10.12688/f1000research.9259.2

Publication metadata is missing for the same reason: version 1 doesn't return results in Scopus and version 2 does return the corresponding article.

However, the DOI 10.12688/f1000research.11022.1 and the peer reviewed version 10.12688/f1000research.11022.3 are both not found by Altmetric and interestingly Scopus finds the version 1, but not the version 3, which should be the one actually indexed. I did not delve further, I guess this preprints business and versioning is relatively new and still a moving target.

Also, the latest available version might not always be the one to use. For example, the DOI 10.12688/f1000research.4952.3 is used for https://bio.tools/Pagal, but the version that received approvals in F1000 and is indexed in MEDLINE is actually the previous version 10.12688/f1000research.4952.2, as can also be seen on the PubMed page: http://www.ncbi.nlm.nih.gov/pubmed/25352981. The Scopus publication metadata and Altmetric widget are also missing, because these services also work with version 2 and not version 3.