Closed amykglen closed 4 years ago
Can we use equivalent_to/close_match relationships in KG2 to point curies to the same MESH term as their equivalent_tos/close_matchs? (explored some here - seems it could add some mappings for KG1 nodes, but not a lot)
yes
Can we use SRI team's node normalizer to point curies to the same MESH term as their equivalent_curies? (explored some here - seems it could increase coverage of KG1 nodes to 18%)
yes; assume we would use CacheControl
caching
Can we use KGNodeIndex.get_equivalent_curies to point curies to the same MESH term as their equivalent curies? (not yet explored, but should have more success for us than the node normalizer(?))
yes, and we might consider doing this before calling out to the SRI normalizer since one is a local DB lookup and the other involves calling a REST service
Can we use SemMedDB to get CUI -> PMID mappings, and then map all our curies to CUIs? (so MESH terms would not be in the picture here) (discussed here) (note: KGNodeIndex.get_equivalent_curies may help here too for finding curie -> CUI mappings)
I like the idea of having this fourth on the list, since it is not really curated by humans
Can we ETL/utilize Text Mining Provider (PI: Bill Baumgartner) as a larger(?)/better(?) SemMedDB/NCBIeUtils alternative?
Seems worth reaching out to Text Mining Provider to see if they have an example code fragment or documentation for accessing their KP.
Idea from Eric: use NodeSynonymizer to help map arbitrary curies to MeSH terms, thus increasing the hit rate to fast NGD
@amykglen volunteered to check the coverage to see if this idea would lead to a large/medium/small improvement to fastNGD
So one part that leaves me puzzled is when I look at PubMed XML records, I see this:
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">Melatonin</Keyword>
<Keyword MajorTopicYN="N">OPA1 and AMPK signaling pathway</Keyword>
<Keyword MajorTopicYN="N">mitochondrial fusion</Keyword>
<Keyword MajorTopicYN="N">mitophagy</Keyword>
<Keyword MajorTopicYN="N">myocardial ischemia reperfusion injury</Keyword>
</KeywordList>
</MedlineCitation>
I don't see MeSH ids anywhere. which is why I'm puzzled. Seems like a straightforward process to scrape out these keywords and search the NodeSynonymizer for these and associate PubMedIds with KG2 nodes and their curies.
So what am I missing?
this is an example of some MESH IDs in a pubmed record - (e.g., "D000070" for the first DescriptorName
):
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000070" MajorTopicYN="N">Acebutolol</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000319" MajorTopicYN="N">Adrenergic beta-Antagonists</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D005260" MajorTopicYN="N">Female</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D006973" MajorTopicYN="N">Hypertension</DescriptorName>
<QualifierName UI="Q000188" MajorTopicYN="Y">drug therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D010869" MajorTopicYN="N">Pindolol</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011247" MajorTopicYN="N">Pregnancy</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011249" MajorTopicYN="N">Pregnancy Complications, Cardiovascular</DescriptorName>
<QualifierName UI="Q000188" MajorTopicYN="Y">drug therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D011433" MajorTopicYN="N">Propranolol</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="N">therapeutic use</QualifierName>
</MeshHeading>
</MeshHeadingList>
but since the NodeSynonymizer creates its synonym 'groupings' based on names anyway, then perhaps bypassing the actual MESH IDs and plugging the text name into the NodeSynonymizer is effectively the same thing?
though after #888, then I suppose NodeSynonymizer won't strictly be based on concept names. so perhaps it's still better to go through mesh terms?
Either way!
python node_synonymizer.py --lookup=MESH:D011433
python node_synonymizer.py --lookup=Propranolol
yield the same extensive result. All you ever want to know about the concept.
You could try both. If you don't get an answer from either one, then it's not a node in KG2, so is irrelevant for our purposes anyway! #888 will only increase our equivalent curies even further, it won't detract from name lookups or anything else.
For speed we'll want to do lookups in batches, so let's optimize the method you'd want to use before setting it loose on billions of PubMed Ids. It may even make sense to transfer the on-disk SQLite database to an in-memory one to speed things up even more.
For papers that don't have MeshHeading, maybe we can also use the KeywordList that I mentioned above. The XML file I'm looking at doesn't have any MeshHeading elements. Only KeywordList. strange. Or maybe automatically use both.
Anyway, some optimization will likely be needed, but I think we should totally be able to precompute everything we need and have an UltraFastNGD that never consults eutils and we can call it every time without slowing things down.
ah, nice, that all makes sense. awesome! yes, I'll do a little preliminary exploration, report back, and then certainly will consult with you about optimizing the synonymizer method before I kick off any sort of full build process...
re:
then it's not a node in KG2, so is irrelevant for our purposes anyway
well, technically we also have BTE as a KP... :) but definitely think first priority is to get an UltraFastNGD that works with KG2.
yes, true about BTE, but I think KG2 already surpasses BTE or will soon do so, and while BTE may have some edges that KG2 doesn't have, do you think BTE has concepts in it that KG2 doesn't have at this point? It would be interesting to find examples. You found some mouse proteins, but I think we probably actively don't want those in KG2? It might be fun to keep a dump file of all Pubmed Mesh concepts that are NOT found in NodeSynonymizer, so that we can eyeball them and ponder why they were not found, and perhaps improve our system, either KG2 or NodeSynonymizer or both. For that matter, maybe Expander can keep a dumpfile of curies that it wanted to find in NodeSynonymizer but didn't and we can learn from that.
ok, started by estimating the number of KG2/KG1 nodes that are mappable to a MESH curie using the NodeSynonymizer - here are the results (these are averages of 20 batches of 4000 randomly selected nodes from each KG):
Estimated number of KG1 nodes mappable to a MESH curie using NodeSynonymizer: 20%
Estimated number of KG2 nodes mappable to a MESH curie using NodeSynonymizer: 12%
so that would still double our current coverage for KG1 nodes, but I'm guessing the text-based method would be even more successful, since it bypasses the whole need for a MESH curie. going to investigate that next.
status update: so after doing some thinking about how the MESH term name/keyword-based method will work, I can't think of a way to really programmatically estimate the coverage it would give us without basically starting to build it... (because it's kind of a backwards path to the prior approach: we need to scrape all the MESH term names/keywords out from all pubmed articles, feed those into the NodeSynonymizer
, see if it returns any matching curies for each term/keyword, record those connections, and then in the end see what percentage of nodes we were able to link to terms/keywords.)
I don't think this will be a huge task though, fortunately.. at least, the scraping pubmed articles portion of the problem should only take a few hours, I think.. and if we have an optimized NodeSynonymizer
method, I think that half of the problem shouldn't be too bad either. so @edeutsch - I think a method that takes in a list of names and then returns a dictionary of each name mapped to its corresponding curie synonyms (whether from KG2, KG1, or SRI) is what would be optimal. does that already exist or seem reasonable?
status update: so after doing some thinking about how the MESH term name/keyword-based method will work, I can't think of a way to really programmatically estimate the coverage it would give us without basically starting to build it... (because it's kind of a backwards path to the prior approach: we need to scrape all the MESH term names/keywords out from all pubmed articles, feed those into the NodeSynonymizer, see if it returns any matching curies for each term/keyword, record those connections, and then in the end see what percentage of nodes we were able to link to terms/keywords.)
I suggest: no estimate needed. just go ahead and build it! It will be grand!
I don't think this will be a huge task though, fortunately.. at least, the scraping pubmed articles portion of the problem should only take a few hours, I think.. and if we have an optimized NodeSynonymizer method, I think that half of the problem shouldn't be too bad either. so @edeutsch - I think a method that takes in a list of names and then returns a dictionary of each name mapped to its corresponding curie synonyms (whether from KG2, KG1, or SRI) is what would be optimal. does that already exist or seem reasonable?
So the current method get_normalizer_results() can do what you envision, but it would definitely need to be optimized because it is too slow. Since we need this to be fast, a bespoke method might be warranted.
But I'm also wondering: might it not be better to use get_canonical_curies()
to just fetch the canonical curies of each concept and associate papers with canonical curies rather than all nodes in the KGs. This make coverage calculation a little harder, but not a big problem. And it requires a two-step process when computing NGD: instead of querying directly with the curie, one has to first determine the canonical curie and then query by that. One advantage is that the database can be much smaller. Maybe even the biggest table could be just two integers (pubmed id and concept_id, an integer pk for each concept you can map) which could be super fast to index and query.
aha, maybe even a cooler idea is you could imagine having a concept table where each concept has a PK id, then a PubMed (MeSH) name, and then a canonical curie. You could even just create concept rows for everything in pubmed independently of whether we can map to curies today. Then, we can have a separate process that can be progressively smarter about mapping MeSH names to canonical curies without having to rescan all of PubMed. So have two processes, one that associates pubmed_ids with keywords (each keyword get a concept row and concept_id PK) (and the linking table is just pubmed_id and concept_id) and then a second independent process that associates those keywords with our curies. either all node curies or canonical curies only, depending on what we want to do. We can then easily ask the question: which concepts can we not find curies for and maybe some simple text munging can resolve a lot more (like maybe writing code that can associate "diabetes, type II" with "type 2 diabetes", etc.) in a faster manner than starting from the beginning again (i.e. just improving the concept table)
One possible snag is that two concept_ids (with different names) map to the can canonical curie. Awkward but we could do something sensible with that.
Another idea is writing the process to inherently work in batches may be a speed bonus. NodeSynonymizer works faster in batches and SQL INSERTs work much faster in batches. I INSERT records in batches of 5000 while building the NodeSynonymizer database and that seems to work reasonably speedily. Doing them individually is slower. Maybe process 10000 publications in a batch, keep concept ids in an in-memory hash during the whole step 1 process and write out pubmed_ids-concept_id mappings in batches of 5000. Do NodeSynonymizer lookups in batches of 10000, too..
Two other ideas that are still bubbling in my head are: 1) Can we do better than just MeSH keywords? a) Can we also use the KeywordList (above) to create and associate concepts? b) Can we build a (separate) simple text miner that looks for node names from the NodeSynonymizer in pubmed abstracts? and store that in the same way? especially protein names/accession numbers? 2) Can we follow on Noel's idea and potentially create edges between concepts that have none? Maybe something like:
I don't advocate these latter bubbly ideas for the first pass, but maybe we can build a database that could facilitate that later?
Apologies for the length brain dump, but maybe some ideas will be useful!
nice! yes, great thoughts... some responses:
compute_ngd
trying to lookup values for non-canonicalized curies - but you're right that it would be pretty easy to just canonicalize the curie prior to lookup, so that's a non-issue. I think coverage calculation will be fine too, as I can just use the NodeSynonymizer
for that, as well.. so I like that plan! so get_canonical_curies()
accepts names as input?and I like your 1b) and 2) ideas, though agree that makes sense to save them for another pass.
great!
so get_canonical_curies() accepts names as input?
not at the moment. But I can add right now.
The only design decision is: should I do this: result = node_synonymizer.get_canonical_curies(curies=curie_list) result = node_synonymizer.get_canonical_curies(names=name_list) or this: result = node_synonymizer.get_canonical_curies(potentially_mixed_names_and_or_curies_list) ?
The latter is potentially easier for the user, but probably slower because I'll need to do two queries, and more complicated to code.
I'm fine with the former method (curies=curie_list
/names=name_list
). (And I think I'm the only one who uses get_canonical_curies
at the moment, so easy for me to update my call to it if needed..)
okay, committed to master: result = node_synonymizer.get_canonical_curies(curies=curie_list, names=name_list) unlimited batch size. minimally tested. please report issues.
awesome! I'll start some testing of that method soon..
fyi, Steve is upping the memory on pubmed.rtx.ai
for me, and then I'll get the scraping-all-of-pubmed portion of the build kicked off on there.. :)
for the record and for clarity's sake: I did a bit more digging into what different elements there are in pubmed xml files, and the ones I'm planning to use (i.e., extract their text content and use it as a key) are:
MESH DescriptorName
s and QualifierName
s, like seen here:
<MeshHeading>
<DescriptorName UI="D004798" MajorTopicYN="N">Enzymes</DescriptorName>
<QualifierName UI="Q000652" MajorTopicYN="Y">urine</QualifierName>
</MeshHeading>
NameOfSubstance
s from ChemicalList
, like seen here:
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D008055">Lipids</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="C015518">sulfolipids</NameOfSubstance>
</Chemical>
</ChemicalList>
Keyword
s, like seen here:
<KeywordList Owner="NOTNLM">
<Keyword MajorTopicYN="N">GLV (green leaf volatile)</Keyword>
<Keyword MajorTopicYN="N">HIPV (herbivory-induced plant volatile)</Keyword>
<Keyword MajorTopicYN="N">Nicotiana attenuata</Keyword>
</KeywordList>
and GeneSymbol
s, like seen here:
<GeneSymbolList>
<GeneSymbol>bcl-2</GeneSymbol>
<GeneSymbol>myc</GeneSymbol>
</GeneSymbolList>
let me know if any of those elements seem like a bad idea... (some documentation on what they capture is here: https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html)
(the original pickleDB fastNGD system only extracts DescriptorName
UIs, fyi..)
Seems like a good idea to record all those different kinds of keywords, I think.
I wonder if it might be worth storing and using the identifiers if provided as well. NodeSynonymizer certain has those terms above: python node_synonymizer.py --lookup MESH:D004798 --kg_name KG2 But it also has the names, so the likelihood of it having the identifier but not the name is probably small, if not zero. If you did keep them, then we could ask the question: in which cases does lookup by identifier and name yield a different answer, which might shed light on a problem.
One other random thought/concern that I didn't mention earlier: I wonder if we can/should try to determine which species each paper is about. If a paper can be identified to be about Arabidopsis, say, maybe it would be better to just discard the paper rather than potentially pollute the index with unrelated information. Unclear if the danger of unjustified exclusion is worse than the benefit of justified exclusion, though. Probably too hard to tackle now, but I'll toss it out there.
hmm, I'm seeing this error when I send a large batch of names to get_canonical_curies()
, @edeutsch:
Sending NodeSynonymizer.get_canonical_curies() a list of 26120 concept names..
Traceback (most recent call last):
File "build_ngd_database.py", line 169, in <module>
main()
File "build_ngd_database.py", line 165, in main
database_builder.build_curie_to_pmids_db()
File "build_ngd_database.py", line 90, in build_curie_to_pmids_db
canonical_curies_dict = synonymizer.get_canonical_curies(names=concept_names)
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Overlay/ngd/../../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_canonical_curies
cursor.execute( sql )
sqlite3.OperationalError: near "association": syntax error
(no error when I send a batch of only 79 names.)
the input list causing the error is in this file: input_names_causing_error.txt
@edeutsch - found a much smaller list that kind of error occurs for:
['Campylobacter Infections', 'lomefloxacin', "Practice Patterns, Physicians'", 'Drug Resistance, Microbial']
Traceback (most recent call last):
File "build_ngd_database.py", line 173, in <module>
main()
File "build_ngd_database.py", line 169, in main
database_builder.build_curie_to_pmids_db()
File "build_ngd_database.py", line 94, in build_curie_to_pmids_db
canonical_curies_dict = synonymizer.get_canonical_curies(names=['Campylobacter Infections', 'lomefloxacin', "Practice Patterns, Physicians'", 'Drug Resistance, Microbial'])
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Overlay/ngd/../../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_canonical_curies
cursor.execute( sql )
sqlite3.OperationalError: near "drug": syntax error
looks like perhaps it's related to the quote within a quote in "Practice Patterns, Physicians'"
?
Sorry, yes, a quote problem. I have just fixed in master
awesome - confirmed it's working for me now. thanks!
ok, so the build is all done - an update:
curie_to_pmids.db
) is 4 GBI started trying to estimate the 'back-up' NGD method's coverage (though it's a little bit difficult to do due to slowness/API errors)... but it's seeming like it's somewhere around 40-50% for KG2 nodes (or, at least, that's the appx. amount NormGoogleDistance.get_pmids_for_all()
returns PMIDs for)
it seems like it'd be useful to know more about what types of nodes we're missing with the new system, so I think I will do a little analysis of that..
outstanding!
I agree that finding some examples where eutils NGD differs substantially from local NGD and trying to understand that would be instructive.
fyi - some estimates as to the degree to which different node types are covered in the new system...
Estimated coverage of KG1 nodes: 34%.
chemical_role: 100%
activity: 100%
continuant: 100%
organism_taxon: 100%
biological_molecular_complex: 100%
occupation: 100%
behavior: 100%
inheritance: 100%
concept: 86%
anatomical_entity: 68%
protein: 66%
chemical_substance: 61%
disease: 50%
product: 50%
unknown_category: 48%
RNA: 23%
phenotypic_feature: 20%
cellular_component: 18%
pathway: 12%
gene: 8%
finding: 7%
biological_process: 6%
disease_susceptibility: 6%
metabolite: 5%
named_thing: 4%
molecular_function: 2%
gene_set: 0%
deprecated_node: 0%
substance: 0%
Estimated coverage of KG2 nodes: 17%.
chemical_role: 100%
semantic_type: 100%
diagnostic_or_prognostic_factor: 100%
environmental_feature: 100%
occupation: 69%
disease_characteristic: 50%
application: 50%
unknown_category: 46%
behavior: 42%
protein: 34%
phenotypic_quality: 31%
disease: 30%
group_of_people: 29%
organism_taxon: 28%
phenotypic_feature: 27%
organization: 27%
language: 23%
information_entity_type: 21%
biological_molecular_complex: 20%
chemical_substance: 17%
named_thing: 16%
cellular_component: 16%
biological_substance: 14%
concept: 13%
product: 12%
substance: 12%
property: 12%
disease_susceptibility: 11%
biological_process: 10%
anatomical_entity: 9%
mechanism_of_action: 9%
biological_role: 9%
gene: 8%
conceptual_entity: 8%
continuant: 7%
pathway: 6%
device: 6%
metabolite: 6%
activity: 3%
molecular_function: 3%
finding: 2%
phenomenon_or_process: 2%
gene_set: 2%
attribute: 1%
drug: 1%
deprecated_node: 0%
cell_type: 0%
relationship_type: 0%
hazardous_substance: 0%
therapy: 0%
method: 0%
prevalence: 0%
RNA: 0%
data_source: 0%
protein_family: 0%
(note: the 'preferred_type' for each node was used to calculate these data.)
so I still need to do some comparisons to eUtils to start to see what it's finding results for that we're not, but because it was easy to do, I first tried out extracting PMIDs from the publications
property on nodes/edges in KG2 and adding those to our big curie->PMID database.
with those in the picture, estimated coverage for the local ngd system is 58% for KG1 nodes, 34% for KG2.
this does take us into the realm of using non-human-added data (the majority of the edges with publications
are from SemMedDB), so perhaps not as ideal. but the coverage boost seems pretty significant.
the final database is 5.4G with this additional data.
ok, per the plan discussed at today's mini-hackathon, I've now:
/home/ubuntu/databases_for_download/
)ComputeNGD
automatically downloads it from the server if it doesn't already exist (no versioning yet)master
(all pytests passing)so now if you run the pytest suite after pulling (and running pip install -r requirements.txt
), the first test that uses overlay(action=compute_ngd)
will take a few minutes extra as it downloads the database. but then after that, ngd should be much faster. (@dkoslicki)
(still improvements to be made, but it is at least much faster than the current system - hence the intermediary deployment.)
Awesome, thanks @amykglen! Trying it out now...
fyi - I just found a bug that was preventing the fast NGD database from being used (the download step was being completed successfully, but the path to use the DB was incorrect) - just tested and pushed a fix to master
.
(so if you tried it before, I think it would've only been using the slow back-up eUtils method. but in the queries I've tested I'm seeing fast NGD usage rates around 90-100%.)
ok, I think I'm going to finally close this issue - I'm seeing an ~95%+ utilization rate of the fastNGD system for ARAX queries on test/production. some examples of 'misses' are written up in #1046 and #1047 (as @edeutsch requested), and I think any discussion of whether we want to pursue even greater coverage can happen over there.
I think 95% is a sensational utilization rate! This has moved NGD values from a slow painful process that I was reluctant to do to something that we can do always. Many thanks for getting this to work!
Our pickleDB fastNGD system seems to work well, except for the fact that it only has coverage of about 10% of KG1 nodes, which means the slower back-up NGD method often has to be used.
The coverage is low because we're lacking
arbitrary curie
->MESH
mappings, and such mappings aren't exactly easy to find (except for using NCBI eUtils, which is very slow and we haven't found a data dump for). (Description of the problem here.)Our initial attempt used OxO to grab as many curie->MESH mappings as we could for KG1 nodes, but we've since thought of some other paths to explore/pursue:
equivalent_to
/close_match
relationships in KG2 to point curies to the same MESH term as theirequivalent_to
s/close_match
s? (explored some here - seems it could add some mappings for KG1 nodes, but not a lot)equivalent_curies
? (explored some here - seems it could increase coverage of KG1 nodes to 18%)KGNodeIndex.get_equivalent_curies
to point curies to the same MESH term as their equivalent curies? (not yet explored, but should have more success for us than the node normalizer(?))KGNodeIndex.get_equivalent_curies
may help here too for finding curie -> CUI mappings)Also note: The mapping work done so far has only been for KG1 nodes, but we will also need mappings for KG2 and BTE curies, since we now use them as KPs as well...
For reference: