NCATS-Gamma / omnicorp

MIT License
1 stars 1 forks source link

Identify articles with multiple versions and only use the most recent version #62

Open gaurav opened 4 years ago

gaurav commented 4 years ago

Some PubMed articles have multiple versions: for example, PMID 31431825 has four versions in PubMed. Given the stream-based parallel processing system Omnicorp currently uses, I don't think there's any way to identify groups of articles with the same PMID, and there don't appear to be any attributes in the XML to indicate which one is the "current" version (see documentation, example).

Currently, we process each version as a separate article, and so produce multiple copies of the triples for each article. To get a sense of the scale of this problem, this appears to affect 474 PMID articles, each of which have two or more versions.

I propose to add an additional script before we start parallel processing of the entire corpus. This script will generate a list of PubMed versions across the entire corpus stored in a text file. The parallel processors can then skip all PubMed versions except for the most recent one, and so ensure that we don't include information from earlier versions in our output.

@cbizon Do you think this is the right approach for ROBOKOP? @balhoff Is there a cleverer way of figuring out which articles are the latest version that I'm missing?

balhoff commented 4 years ago

@gaurav I think that seems reasonable if the the script is relatively quick.

gaurav commented 4 years ago

It does seem pretty quick -- it took around 2.5 hours on the cluster without parallelization, and gave me a list of 1,358 PMIDs with multiple versions. As per our conversation in https://github.com/NCATS-Gamma/omnicorp/pull/63#issuecomment-610632620, I'll try using akka-stream/Monix/ZIO to parallelize it before I turn it into a pull request, which should speed it up x32 (assuming 32 cores), but for now I'll focus on modifying Main so that it ignores all but the last version of each of these PMIDs and so avoids producing duplicates.