biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Genome Interval Queries Responses don't seem to be consistent #102

Closed tomkp75 closed 3 years ago

tomkp75 commented 3 years ago

Hello,

Thanks a lot for this great API.

I figured that performing the following request would return different results upon refreshing the page: https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&limit=1

In this particular case it switches between WASH7P and DDX11L1.

Thanks,

Tom

namespacestd0 commented 3 years ago

Thank you for contacting us. That's a very acute observation, a result of the internal workings of the distributed database system we use (most likely). Each time the request may reach a different server replica, that computes the score of a query by itself, taking into account of server specific statistics, and may under rare circumstances result in inconsistent scoring. The query endpoint's normal presentation is more or less for data exploration. Once you have decided on a specific query, you can use the fetch all feature to lock on a frozen view of the data for consistent retrieval (with pagination).

tomkp75 commented 3 years ago

Thanks @namespacestd0. I believe your suggestion is to use the parameter fetch_all=TRUE, is that correct? In that case the issue remains. ex. https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&fetch_all=TRUE

namespacestd0 commented 3 years ago

Do you mean the results across the fetch all calls are inconsistent or using the scroll id provided by a fetch all call cannot consistently retrieve information by pagination?

tomkp75 commented 3 years ago

Results across the fetch all calls are inconsistent. I'm not using pagination in this example.

namespacestd0 commented 3 years ago

That's expected, what I meant was if scoring prevented you from getting all the results through pagination, you can use fetch all to lock on one version to go through. We'll leave this issue open and evaluate the cost of ensuring consistent scoring to determine if we can introduce this additional guarantee in the future.

tomkp75 commented 3 years ago

I understand now. It wouldn't make it for me as I'm implementing an automated process and the greater match could be wrong on the example I provided, but I also read you mentioned The query endpoint's normal presentation is more or less for data exploration

namespacestd0 commented 3 years ago

I see. We'll continue to explore the possibility to provide stable scoring. Meanwhile, another option is to consider adding a customized sorting parameter https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&sort=entrezgene I think it could make sense in the interval query but I understand this may or may not be practical in your automated process depending on other factors.

newgene commented 3 years ago

Or if you like to sort genes by their genomic positions:

https://mygene.info/v3/query/?q=chr1:11869-14409&species=human&sort=genomic_pos.start

namespacestd0 commented 3 years ago

That probably makes more sense considering entrezgene is a string field, not suitable for sorting.

namespacestd0 commented 3 years ago

The above solutions should be practical enough to address this issue.