cancervariants / metakb

Central repository for the VICC metakb web application
MIT License
15 stars 4 forks source link

How to handle PubMed documents from various sources #209

Open korikuzma opened 1 year ago

korikuzma commented 1 year ago

For our sources, we store PMIDs as Documents. Some sources provide more information than others. For example, OncoKB only gives the PMID. Whereas CIViC gives authors + description. An example would be for pmid:22663011. Currently, we're just taking the first source that loads that document since there is an ID constraint in the db. We should think on how we want to handle this as we add more sources. Should we combine source data? Should we prefix the ID with the source it came from?

jsstevenson commented 5 months ago

Some of this metadata, like article title and authors, should be objectively determinable -- we could use NCBI esearch/efetch to grab a minimum set of attributes for every article, regardless of what a source supplies.

That said, curated properties like a CIViC description definitely go above and beyond that.

jsstevenson commented 3 months ago

My proposal:

1) it's a job for eutils! 2) given a DOI or PMID, fetch any basic metadata we might want (e.g. for display purposes -- author list, title, date, journal, issue/vol/no, etc). It's relatively safe to go between DOI <-> PMID so we can pick one as an identifier when it's available 3) if neither of the above is available, just fill in what we can and figure out some way to identify it 4) If sources DO provide PMID/DOI and additional metadata, we could check that the stuff they provide matches what comes out of our eutils lookup, and log/raise warnings for any discrepancies