Open andrewsu opened 2 years ago
@andrewsu This is two parts:
An explanation of the second point:
For the core biothings APIs, the data is organized by entity so MyDisease.info is organized by Disease. When querying from Disease -> Gene, we can look up everything under that disease's disgenet.genes_related_to_disease section, which includes all of the information in the second screenshot.
However, when we want to query from Gene -> Disease, we need to match the gene ID AKA a specific record under the disgenet.genes_related_to_disease section. However, a query will retrieve everything under that section (not just the specific record that has that gene ID) because the data is structured by disease.
For example, POST this query starting with the Gene NDUFA1 (4694) to https://mydisease.info/v1/query?fields=disgenet.xrefs,disgenet.genes_related_to_disease:
{
"q": "4694",
"scopes": "disgenet.genes_related_to_disease.gene_id"
}
The response includes diseases where ONE of their objects matches the query, but it includes ALL of the genes related to those diseases rather than only the objects that have the matching gene...
I hit a similar problem when trying to make more specific queries to map to more specific biolink predicates (like marker/mechanism under MyDisease's CTD Disease-Chemical information). I describe another example in the notes here. Because I get all the objects under the disease back rather than the matching objects only, I cannot make more specific queries...
Returning to this: this is an inherent limitation from how these records are structured (and indexed and retrieved - the querying process). ~Going to propose closing this unless we plan to address it...~
After discussion with Andrew 12/6, we decided to keep this open as a non-critical thing....to discuss + maybe work on when there is time...
If this issue can be addressed through api_response_transform or elsewhere in records handling, we might now be in a better position to address this?
@tokebe The last time I talked about it with Andrew, it seemed kinda hard...
I think this is a limitation imposed by the document-structure / biothings querying ability itself.
This isn't an issue for "association-based" APIs, AKA where the structure is "one document per association" and all the info on the association is kept in a separate part of the document from the entity IDs.
As soon as a document has parts (like multiple associations in 1 document, each document represents 1 of the entity IDs)....this problem happens.
Hmm...this seems like it should be possible with post-processing in the transformer, but I agree that this would have to be basically on a per-API basis. We'd have to write new transformers for this, so it makes sense this should remain non-critical until we have more bandwidth.
@tokebe Can you explain how it could be done with post-processing in the transformer? I was looking into this issue a bit and it seems like when querying from Gene->Disease, the disgenet score/information is missing so there would need to be other queries done to retrieve this information again.
Also, would it be possible/practical to have something that says that mydisease should always be queried starting from Disease?
I was under the impression that the issue is that querying Gene->Disease returns the the whole document, which we currently don't have the logic to pull out the disgenet score/information? this would be in the untransformedHits
prior to going through the transformer.
If this isn't the case, then yes, we'd have to come up with some other method of retrieving the additional information. I'm not sure exactly how practical it might be to specifically query mydisease Disease-first always, though it might be relatively doable with a custom query builder. This would still require a custom transformer, however, and some additional logic to ensure records are created in the correct direction.
The preference would definitely be to post-process untransformedHits
in a new transformer over custom querying logic, if possible.
I did some more investigation and it doesn't grab the disgenet score/information at all when querying from Gene->Disease (the params pulled from the x-bte are completely different). I think what @colleenXu is saying above is this is a limitation of how the data is structured. So if we were to take the post-processing route, we would have to make a whole nother query to retrieve the disgenet score/info document, process that, then reincorporate it into the results.
Below are the query configs and the resulting unTransformedHits:
I suppose this makes the query-direction route more viable -- we'd need a separate query builder for mydisease that checks the subject/object semantic type and queries in reverse appropriately. It would have to somehow tag this such that the record is constructed in reverse of the query where appropriate as well.
Perhaps a reverseAfterQuery
value that can be attached to the query_info
object, which is passed to the transformer to instruct it to construct the record in reverse (anywhere else, such as reversing post-record-built, would cause issues with directionality down the line)
Err....I was out when this convo started but perhaps some more explanation / my perspective can help.
I am in agreement with Jackson's points here, and that working with the unTransformedHits is better.
But to develop that code, one will have to mutate the smartapi specs or work with a custom version of the smartapi yaml where the fields are specified differently (to retrieve all the info available in forward querying, during a reverse query).
Notice the query I give in my original post. This query doesn't have the same "fields" specified as the query in the x-bte annotation right now, because we don't have the features to correctly process it (it would just be extra data to send over the internet / ignore while processing).
Noting a related old discussion (internal lab Slack link): besides the "reverse" issue here, there's an issue of not being able to get a subset of the response. This is a problem when we want to treat those subsets differently (ex: assigning different biolink predicates or edge-attributes for the TRAPI response).
Also pasted below:
colleenxu Oct 14 2021 is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...
Jerry Oct 15 2021 colleenxu not supported through biothings but if you can directly query es, you can use https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html
colleenxu Oct 15 2021 I don't think I'll do direct es...but I don't understand how highlighting would do what I specified above. This sounds a bit closer to what I wanted to do...https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#post-filter ....
Jerry Oct 15 2021 Highlighting shows where the matches are, you do have to further process the result to use it. Post filter applies to the case when you use both search and aggregation, it's not what you're looking for. :) (edited)
colleenxu Oct 15 2021 I see now, further processing would be needed to "filter out" stuff that didn't have the highlighting...
Not clear if post-processing improvements (JQ-related, using biothings apis query abilities) will help overcome the issues here. Linking to #656, #489 and #521
JQ could be able to help. For this example, the following wrap filter could be used: .disgenet.variants_related_to_disease |= list_filter_any(["source:CLINVAR"]) (since there is only one filter being used list_filter_all and list_filter_any would do the same thing)
Oct 14 2021 is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...
Noting that we previously decided list_filter could not be used to address the "reverses" issue: see the internal Slack discussion starting here. It's hard to paste the whole convo here, but I may do it later...
We still have issues with not being able to retrieve all the information on the association in "reverse" direction.
I was able to get MyChem aeolus count info to show in the reverse direction, by doing a non-batch query and using jmespath (only show the part of the json object that matches the starting ID).
I was also able to get MyChem Chembl treats reference info to show in the reverse direction (see commit).
But I wasn't able to use the same method to get the MyChem chembl drug-mechanism clinicaltrial info to show (drugMechChemblEnsembl-rev
, drugMechChemblUniprot-rev
):
chembl.drug_mechanisms.target_components.uniprot
, I get a 500 error (GET query version).
chembl.drug_mechanisms.target_components
section, when I want the entire chembl.drug_mechanisms
section removed if chembl.drug_mechanisms.target_components.uniprot
doesn't match the starting ID``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=chembl.molecule_chembl_id,chembl.drug_mechanisms&jmespath=chembl.drug_mechanisms.target_components|[?uniprot=='Q16602']' \ --header 'Content-Type: application/json' \ --data '{ "q": ["Q16602"], "scopes": ["chembl.drug_mechanisms.target_components.uniprot"] }' ```
And My Variant's civic-geneDisease-rev
has the same problem (example GET query with 500 error) but including all the info for the variant (rather than the variant-disease pair) would probably be more of a problem:
civic.evidence_items.disease
section rather than the whole civic.evidence_items
section``` curl --location --globoff 'https://myvariant.info/v1/query?size=1000&fields=civic.entrez_id,civic.evidence_items&jmespath=civic.evidence_items.disease|[?doid=='DOID:9256']' \ --header 'Content-Type: application/json' \ --data '{ "q": "DOID:9256", "scopes": "civic.evidence_items.disease.doid" }' ```
Here's a list of the entity-based BioThings APIs (affected by the reverses issue):
These are the reverse operations where the forward direction has publication info that would be nice to retrieve. I organized by what seems doable now with jmespath (related to #733?)
BPToGene
, MFToGene
, CCToGene
gene-disease
, variant-disease
, phenotype-disease
, phenotype-disease2
, chemical-disease
, chemical-disease2
drugMechChemblEnsembl-rev
, drugMechChemblUniprot-rev
(see discussion above)civic-geneDisease-rev
(see discussion above), civic-variantDisease-rev
(haven't tried yet, but should have basically the same problem) geneToDisease
: see post belowI found a different jmespath issue with MyGene geneToDisease
while working on #803
Query: ``` curl --location 'https://mygene.info/v3/query?size=1000&fields=entrezgene%2Cclingen' \ --header 'Content-Type: application/json' \ --data '{ "q": ["MONDO:0100283"], "scopes": "clingen.clinical_validity.mondo" }' ``` Example hits: * first hit has 2 clingen.clinical_validity objects, where 1 matches the disease I queried. * VS the second hit has 1 object ``` { "query": "MONDO:0100283", "_id": "10000", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": [ { "classification": "definitive", "classification_date": "2021-07-29T21:34:39.431Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z", "sop": "SOP7" }, { "classification": "limited", "classification_date": "2021-10-26T15:00:30.155Z", "disease_label": "microcephaly", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0001149", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_6e3b524c-5d27-43d6-a0db-4f8f7cf1f872-2021-10-26T150030.155Z", "sop": "SOP8" } ] }, "entrezgene": "10000" }, { "query": "MONDO:0100283", "_id": "5296", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": { "classification": "definitive", "classification_date": "2021-07-29T21:36:16.452Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_fc9a451e-0e75-47d2-a090-a2409732c465-2021-07-29T213616.452Z", "sop": "SOP7" } }, "entrezgene": "5296" }, ```
Query: ``` curl --location --globoff 'https://mygene.info/v3/query?size=1000&fields=entrezgene,clingen&jmespath=clingen.clinical_validity|[?mondo=='MONDO:0100283']' \ --header 'Content-Type: application/json' \ --data '{ "q": ["MONDO:0100283"], "scopes": "clingen.clinical_validity.mondo" }' ``` Those same example hits: * first hit looks how I expect: there's now 1 clinical_validity object that matches the disease queried (vs 2 before) * VS the second hit now has `null`. But the clinical_validity object it had before matched the disease queried... ``` { "query": "MONDO:0100283", "_id": "10000", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": [ { "classification": "definitive", "classification_date": "2021-07-29T21:34:39.431Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z", "sop": "SOP7" } ] }, "entrezgene": "10000" }, { "query": "MONDO:0100283", "_id": "5296", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": null }, "entrezgene": "5296" }, ```
I suspect jmespath is having issue with the array (multiple clinical_validity objects) vs object (1 clinical_validity object) in the original document...
Made issues for the jmespath stuff I'm seeing:
geneToDisease
in https://github.com/biothings/biothings_explorer/issues/803 Potential breakthrough: using a new parameter jmespath_exclude_empty: true
to remove hits that don't fit multiple criteria. Don't know if this is only live on MyChem or on all BioThings yet. See example https://github.com/biothings/biothings_explorer/issues/727#issuecomment-2046058611
Tentatively labeling this a bug, but it may be an inherent limitation.
This query
produces this result:![image](https://user-images.githubusercontent.com/2635409/136506894-5209102b-6433-409a-82a5-fabbe631b364.png)
But when I simply flip the
subject
andobject
, the result has more edge provenanceIs there some inherent limitation in the smartAPI annotation on why this asymmetry has to exist?