PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

A query bug that is probably related to ID mapping #296

Closed ozgunbabur closed 6 years ago

ozgunbabur commented 6 years ago

I am getting inconsistent results from commonstream query. I want to search the common downstream of CDK4 and CDKN2A, but the below query cannot find anything.

http://www.pathwaycommons.org/pc2/graph?source=CDK4&source=CDKN2A&kind=commonstream&direction=DOWNSTREAM

If I provide IDs of specific Protein objects of CDK4 and CDKN2A though, I get results.

http://www.pathwaycommons.org/pc2/graph?source=http://pathwaycommons.org/pc2/Protein_0278d36ac4ae5415130e34a7f06e2072&source=http://pathwaycommons.org/pc2/Protein_e9a63188fda151e43d6ca8e989161dc6&kind=commonstream&direction=DOWNSTREAM

In theory, the second query should return a subset of the first one. There is probably an ID mapping problem from gene symbols to these Proteins. Interestingly, the neighborhood query seems working fine, so the bug may be specific to commonstream.

IgorRodchenkov commented 6 years ago

Good catch, but -

@ozgunbabur I think, it's not an id-mapping problem (mapping is the same for all graph queries) but rather - the more seeds the less chance of finding a common stream sub-model.

How about we disallow using IDs - accept only URIs - for all graph query types but neighborhood (or at least let's do this for commonstream)?

ozgunbabur commented 6 years ago

Good point. I was thinking there are only two seeds in the query because commonstream actually handles "multiple Proteins per ProteinReference" problem and treats the Proteins that belong to a ProteinReference as if one source node.

When you provide two ProteinReference to the runCommonStream method of QueryExecuter of Paxtools, it generates a set that contains two sets of Proteins, and the query looks for the common stream that connect to at least one element from each set. What I mean is, if the query is executed with two source ProteinReference objects, then it works fine.

I don't know the exact source set used when we give CDK4 and CDKN2A as source symbols. Do you think it can be more than two ProteinReference objects?

IgorRodchenkov commented 6 years ago

Yes, Ozgun, can be more than two PRs, and in fact DnaReference, RnaReference, etc. too - also due to there are normalized (UniProt) PRs and not-normalized (due to ambiguous id-mappind) protein references, from different datasources.

IgorRodchenkov commented 6 years ago

By the way, don't forget that PC9 graph queries, if the source or target are not full URIs, finds PhysicalEntity and Gene object URIs first (using its special full-text index and id-mapping) and uses them as seeds for the paxtools graph query executor. See, we do not search for ProteinReferences (or ERs), because this way we'd miss some (perhaps much) data, where PE or Gene objects have no entity reference but just xrefs.

On Mon, May 7, 2018 at 3:56 PM, Igo R rodche@gmail.com wrote:

Yes, Ozgun, can be more than two PRs, and in fact DnaReference, RnaReference, etc. too - also due to there are normalized (UniProt) PRs and not-normalized (due to ambiguous id-mappind) protein references, from different datasources.

ozgunbabur commented 6 years ago

Sorry I could not respond earlier. I think there is a relatively easy fix to this problem. Current graph query architecture assumes that a user-provided query term will bring only one ER. But this is not the case, hence the buggy results.

Here is my suggestion: I will change the QueryExecuter methods in Paxtools to accept Set of Sets (currently it is just one set). When I do this, the paths-between query will look for paths between the given sets, not only between the given elements. That will similarly modify common stream query.

When the Paxtools part is done, cpath2 can use it with this modification: each user query term, when they are not URIs but keywords, generates a set of elements. So when user provides two terms, cpath2 constructs a set with two sets inside, and each inner set contains all the elements related to its corresponding query term. For consistency, URIs can generate a set with single element.

This should fix the faulty queries. Neighborhood and paths-from-to queries are not affected from this design bug. But paths-between and common-stream need this fix. What do you think Igor?

IgorRodchenkov commented 6 years ago

Sounds good.

If this change won't break existing Paxtools QueryExecuter public API, but simply adds new methods, then go ahead. I guess, would then simply replace this line (and similar one in commonstream) with a loop over sources, creating a set of URIs for each one, correct?

Otherwise, we have to bump Paxtools version to 6.0.0-SNAPSHOT, release it to OSSRH, and probably even wait with using this untill PC11...

ozgunbabur commented 6 years ago

OK, I added new public methods to the QueryExecuter.

runNeighborhoodMultiSet runPathsBetweenMultiSet runPathsFromToMultiSet runCommonStreamWithPOIMultiSet

Igor, just as you said above, changing Set<BioPAXElement> with Set<Set<BioPAXElement> and using new methods instead of the originals should fix the problem.

IgorRodchenkov commented 6 years ago

@ozgunbabur you said "Neighborhood and paths-from-to queries are not affected..."; so why to add new methods for those two?

ozgunbabur commented 6 years ago

Only for consistency. I thought you would not like the idea of generating Set<BioPAXElement> for one type of queries and Set<Set<BioPAXElement>> for other types. You can totally leave neighborhood and paths-from-to unchanged, but it is also possible to consistently use Set<Set<BioPAXElement>> for all query types.

IgorRodchenkov commented 6 years ago

Alright then. One more problem - what if query source (or target) contains several IDs that map to the same entity reference (i.e., - to the same primary ID and thus same set of entities in PC model)? I suspect this fix won't work then. Looks, we have to cluster query IDs first, and still, use these new methods.. Aren't we overthinking here?...

ozgunbabur commented 6 years ago

You have a point there. If the user provides keywords that map to the same ER, then the paths-between query will bring paths between PEs of that ER. Maybe we can put out a warning when the user does that, but that is not practical with the web service API. Well, at least, this time it is user error and not an inherent bug.

IgorRodchenkov commented 6 years ago

Fair enough.