biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

for entity-based record structures (BioThings APIs), "reverse" operations cannot retrieve the same information as "forward" operations #316

Open andrewsu opened 2 years ago

andrewsu commented 2 years ago

Tentatively labeling this a bug, but it may be an inherent limitation.

This query

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n2"
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:2475"
                    ],
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "ids": [
                        "MONDO:0003406"
                    ],
                    "categories": [
                        "biolink:Disease",
                        "biolink:PhenotypicFeature"
                    ]
                }
            }
        }
    }
}

produces this result: image

But when I simply flip the subject and object, the result has more edge provenance

image

Is there some inherent limitation in the smartAPI annotation on why this asymmetry has to exist?

colleenXu commented 2 years ago

@andrewsu This is two parts:


An explanation of the second point:

For the core biothings APIs, the data is organized by entity so MyDisease.info is organized by Disease. When querying from Disease -> Gene, we can look up everything under that disease's disgenet.genes_related_to_disease section, which includes all of the information in the second screenshot.

However, when we want to query from Gene -> Disease, we need to match the gene ID AKA a specific record under the disgenet.genes_related_to_disease section. However, a query will retrieve everything under that section (not just the specific record that has that gene ID) because the data is structured by disease.

For example, POST this query starting with the Gene NDUFA1 (4694) to https://mydisease.info/v1/query?fields=disgenet.xrefs,disgenet.genes_related_to_disease:

{
    "q": "4694",
    "scopes": "disgenet.genes_related_to_disease.gene_id"
}

The response includes diseases where ONE of their objects matches the query, but it includes ALL of the genes related to those diseases rather than only the objects that have the matching gene...

I hit a similar problem when trying to make more specific queries to map to more specific biolink predicates (like marker/mechanism under MyDisease's CTD Disease-Chemical information). I describe another example in the notes here. Because I get all the objects under the disease back rather than the matching objects only, I cannot make more specific queries...

colleenXu commented 2 years ago

Returning to this: this is an inherent limitation from how these records are structured (and indexed and retrieved - the querying process). ~Going to propose closing this unless we plan to address it...~

After discussion with Andrew 12/6, we decided to keep this open as a non-critical thing....to discuss + maybe work on when there is time...

tokebe commented 2 years ago

If this issue can be addressed through api_response_transform or elsewhere in records handling, we might now be in a better position to address this?

colleenXu commented 2 years ago

@tokebe The last time I talked about it with Andrew, it seemed kinda hard...

I think this is a limitation imposed by the document-structure / biothings querying ability itself.

colleenXu commented 2 years ago

This isn't an issue for "association-based" APIs, AKA where the structure is "one document per association" and all the info on the association is kept in a separate part of the document from the entity IDs.

As soon as a document has parts (like multiple associations in 1 document, each document represents 1 of the entity IDs)....this problem happens.

tokebe commented 2 years ago

Hmm...this seems like it should be possible with post-processing in the transformer, but I agree that this would have to be basically on a per-API basis. We'd have to write new transformers for this, so it makes sense this should remain non-critical until we have more bandwidth.

ericz1803 commented 1 year ago

@tokebe Can you explain how it could be done with post-processing in the transformer? I was looking into this issue a bit and it seems like when querying from Gene->Disease, the disgenet score/information is missing so there would need to be other queries done to retrieve this information again.

Also, would it be possible/practical to have something that says that mydisease should always be queried starting from Disease?

tokebe commented 1 year ago

I was under the impression that the issue is that querying Gene->Disease returns the the whole document, which we currently don't have the logic to pull out the disgenet score/information? this would be in the untransformedHits prior to going through the transformer.

If this isn't the case, then yes, we'd have to come up with some other method of retrieving the additional information. I'm not sure exactly how practical it might be to specifically query mydisease Disease-first always, though it might be relatively doable with a custom query builder. This would still require a custom transformer, however, and some additional logic to ensure records are created in the correct direction.

The preference would definitely be to post-process untransformedHits in a new transformer over custom querying logic, if possible.

ericz1803 commented 1 year ago

I did some more investigation and it doesn't grab the disgenet score/information at all when querying from Gene->Disease (the params pulled from the x-bte are completely different). I think what @colleenXu is saying above is this is a limitation of how the data is structured. So if we were to take the post-processing route, we would have to make a whole nother query to retrieve the disgenet score/info document, process that, then reincorporate it into the results.

Below are the query configs and the resulting unTransformedHits:

Screen Shot 2022-08-23 at 3 59 25 PM Screen Shot 2022-08-23 at 3 38 58 PM
tokebe commented 1 year ago

I suppose this makes the query-direction route more viable -- we'd need a separate query builder for mydisease that checks the subject/object semantic type and queries in reverse appropriately. It would have to somehow tag this such that the record is constructed in reverse of the query where appropriate as well.

Perhaps a reverseAfterQuery value that can be attached to the query_info object, which is passed to the transformer to instruct it to construct the record in reverse (anywhere else, such as reversing post-record-built, would cause issues with directionality down the line)

colleenXu commented 1 year ago

Err....I was out when this convo started but perhaps some more explanation / my perspective can help.

I am in agreement with Jackson's points here, and that working with the unTransformedHits is better.

But to develop that code, one will have to mutate the smartapi specs or work with a custom version of the smartapi yaml where the fields are specified differently (to retrieve all the info available in forward querying, during a reverse query).

Notice the query I give in my original post. This query doesn't have the same "fields" specified as the query in the x-bte annotation right now, because we don't have the features to correctly process it (it would just be extra data to send over the internet / ignore while processing).

colleenXu commented 1 year ago

Noting a related old discussion (internal lab Slack link): besides the "reverse" issue here, there's an issue of not being able to get a subset of the response. This is a problem when we want to treat those subsets differently (ex: assigning different biolink predicates or edge-attributes for the TRAPI response).

Also pasted below:

colleenxu Oct 14 2021 is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

Jerry Oct 15 2021 colleenxu not supported through biothings but if you can directly query es, you can use https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html

colleenxu Oct 15 2021 I don't think I'll do direct es...but I don't understand how highlighting would do what I specified above. This sounds a bit closer to what I wanted to do...https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#post-filter ....

Jerry Oct 15 2021 Highlighting shows where the matches are, you do have to further process the result to use it. Post filter applies to the case when you use both search and aggregation, it's not what you're looking for. :) (edited)

colleenxu Oct 15 2021 I see now, further processing would be needed to "filter out" stuff that didn't have the highlighting...

colleenXu commented 9 months ago

Not clear if post-processing improvements (JQ-related, using biothings apis query abilities) will help overcome the issues here. Linking to #656, #489 and #521

rjawesome commented 9 months ago

JQ could be able to help. For this example, the following wrap filter could be used: .disgenet.variants_related_to_disease |= list_filter_any(["source:CLINVAR"]) (since there is only one filter being used list_filter_all and list_filter_any would do the same thing)

Oct 14 2021 is there a way to query and only get the part of the document that matches back? For example, I can query https://mydisease.info/v1/query?q=disgenet.variants_related_to_disease.source:CLINVAR&fields=disgenet.variants_related_to_disease,disgenet.xrefs but it'll return ALL the variants for the disease when maybe only 1-2 of those variants actually match source:CLINVAR...

colleenXu commented 8 months ago

Noting that we previously decided list_filter could not be used to address the "reverses" issue: see the internal Slack discussion starting here. It's hard to paste the whole convo here, but I may do it later...

colleenXu commented 3 months ago

Update

We still have issues with not being able to retrieve all the information on the association in "reverse" direction.

I was able to get MyChem aeolus count info to show in the reverse direction, by doing a non-batch query and using jmespath (only show the part of the json object that matches the starting ID).

I was also able to get MyChem Chembl treats reference info to show in the reverse direction (see commit).


But I wasn't able to use the same method to get the MyChem chembl drug-mechanism clinicaltrial info to show (drugMechChemblEnsembl-rev, drugMechChemblUniprot-rev):

POST query version

``` curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=chembl.molecule_chembl_id,chembl.drug_mechanisms&jmespath=chembl.drug_mechanisms.target_components|[?uniprot=='Q16602']' \ --header 'Content-Type: application/json' \ --data '{ "q": ["Q16602"], "scopes": ["chembl.drug_mechanisms.target_components.uniprot"] }' ```

And My Variant's civic-geneDisease-rev has the same problem (example GET query with 500 error) but including all the info for the variant (rather than the variant-disease pair) would probably be more of a problem:

POST query where I get the error

``` curl --location --globoff 'https://myvariant.info/v1/query?size=1000&fields=civic.entrez_id,civic.evidence_items&jmespath=civic.evidence_items.disease|[?doid=='DOID:9256']' \ --header 'Content-Type: application/json' \ --data '{ "q": "DOID:9256", "scopes": "civic.evidence_items.disease.doid" }' ```

colleenXu commented 3 months ago

Here's a list of the entity-based BioThings APIs (affected by the reverses issue):

colleenXu commented 3 months ago

These are the reverse operations where the forward direction has publication info that would be nice to retrieve. I organized by what seems doable now with jmespath (related to #733?)

colleenXu commented 3 months ago

I found a different jmespath issue with MyGene geneToDisease while working on #803

If I do this query, I get genes that match the disease, but I want to only keep the `clingen.clinical_validity` objects that have the matching disease.

Query: ``` curl --location 'https://mygene.info/v3/query?size=1000&fields=entrezgene%2Cclingen' \ --header 'Content-Type: application/json' \ --data '{ "q": ["MONDO:0100283"], "scopes": "clingen.clinical_validity.mondo" }' ``` Example hits: * first hit has 2 clingen.clinical_validity objects, where 1 matches the disease I queried. * VS the second hit has 1 object ``` { "query": "MONDO:0100283", "_id": "10000", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": [ { "classification": "definitive", "classification_date": "2021-07-29T21:34:39.431Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z", "sop": "SOP7" }, { "classification": "limited", "classification_date": "2021-10-26T15:00:30.155Z", "disease_label": "microcephaly", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0001149", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_6e3b524c-5d27-43d6-a0db-4f8f7cf1f872-2021-10-26T150030.155Z", "sop": "SOP8" } ] }, "entrezgene": "10000" }, { "query": "MONDO:0100283", "_id": "5296", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": { "classification": "definitive", "classification_date": "2021-07-29T21:36:16.452Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_fc9a451e-0e75-47d2-a090-a2409732c465-2021-07-29T213616.452Z", "sop": "SOP7" } }, "entrezgene": "5296" }, ```

But when I add jmespath, the hits that had one clinical_validity object (with the matching disease) become null.

Query: ``` curl --location --globoff 'https://mygene.info/v3/query?size=1000&fields=entrezgene,clingen&jmespath=clingen.clinical_validity|[?mondo=='MONDO:0100283']' \ --header 'Content-Type: application/json' \ --data '{ "q": ["MONDO:0100283"], "scopes": "clingen.clinical_validity.mondo" }' ``` Those same example hits: * first hit looks how I expect: there's now 1 clinical_validity object that matches the disease queried (vs 2 before) * VS the second hit now has `null`. But the clinical_validity object it had before matched the disease queried... ``` { "query": "MONDO:0100283", "_id": "10000", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": [ { "classification": "definitive", "classification_date": "2021-07-29T21:34:39.431Z", "disease_label": "overgrowth syndrome and/or cerebral malformations due to abnormalities in MTOR pathway genes", "gcep": "Brain Malformations", "moi": "AD", "mondo": "MONDO:0100283", "online_report": "https://search.clinicalgenome.org/kb/gene-validity/CGGV:assertion_52b1df18-387f-4c38-a655-682e4d2eb378-2021-07-29T213439.431Z", "sop": "SOP7" } ] }, "entrezgene": "10000" }, { "query": "MONDO:0100283", "_id": "5296", "_score": 10.205105, "clingen": { "_license": "https://www.clinicalgenome.org/docs/terms-of-use/", "clinical_validity": null }, "entrezgene": "5296" }, ```

I suspect jmespath is having issue with the array (multiple clinical_validity objects) vs object (1 clinical_validity object) in the original document...

colleenXu commented 3 months ago

Made issues for the jmespath stuff I'm seeing:

colleenXu commented 1 month ago

Potential breakthrough: using a new parameter jmespath_exclude_empty: true to remove hits that don't fit multiple criteria. Don't know if this is only live on MyChem or on all BioThings yet. See example https://github.com/biothings/biothings_explorer/issues/727#issuecomment-2046058611