ModelSEED / ModelSEED-UI

ModelSEED UI (beta)
MIT License
6 stars 7 forks source link

Problems in search #30

Closed mendessoares closed 5 years ago

mendessoares commented 8 years ago

Here are some issues I noticed when doing searches in the microbe genomes:

mmundy42 commented 8 years ago

Another example is a search for "mycobacterium tuberculosis". The results show 2299 items which doesn't help you find the genomes for just M. tuberculosis. screen shot 2015-09-28 at 9 29 09 am

nconrad commented 8 years ago

Thanks for reporting. The first issue is related to https://github.com/PATRIC3/patric3_website/issues/347. I am going to write some tests against the data API to hopefully get this resolved sooner rather than later.

For the second issue, I'm currently not sure how to do a proper search with spaces.

samseaver commented 8 years ago

I'd leave the space in the search string itself, from experience, people are generally trying to find one thing. If we want to allow people to search for multiple entities, we can use AND/OR operators or allow them to add search boxes in order to search multiple strings at the same time.

nconrad commented 8 years ago

Right. I agree. Once the backend handles special chars in some manner, I plan on adding special operators and/or an advanced section.

Just to be clear, by "I'm currently not sure how to do a proper search with spaces.", I meant that I don't know how to make the data API search for exact strings with spaces in them. None of the following work:

Encoded space: https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&or(eq(genome_name,*mycobacterium%20tuberculosis*),eq(genome_id,*mycobacterium%20tuberculosis*),eq(genus,*mycobacterium%20tuberculosis*),eq(taxon_id,*mycobacterium%20tuberculosis*),eq(contigs,*mycobacterium%20tuberculosis*))

Encoded "+": https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&or(eq(genome_name,*mycobacterium%2Btuberculosis*),eq(genome_id,*mycobacterium%2Btuberculosis*),eq(genus,*mycobacterium%2Btuberculosis*),eq(taxon_id,*mycobacterium%2Btuberculosis*),eq(contigs,*mycobacterium%2Btuberculosis*))

Using a "+": https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&or(eq(genome_name,*mycobacterium+tuberculosis*),eq(genome_id,*mycobacterium+tuberculosis*),eq(genus,*mycobacterium+tuberculosis*),eq(taxon_id,*mycobacterium+tuberculosis*),eq(contigs,*mycobacterium+tuberculosis*))

So... I'm out of ideas. :(

dmachi commented 8 years ago

Hmm, spaces seem to work for me encoded either as %20 or with an unencoded +.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,mycobacterium+tuberculosis)

{"responseHeader":{"status":0,"QTime":1,"params":{"q":"genome_name:mycobacterium+tuberculosis","fl":"genome_name,genome_id,genus,taxon_id,contigs","sort":"genome_name asc","fq":"public:true","rows":"25","wt":"json"}},"response":{"numFound":1946,"start":0,"docs":[{"contigs":16,"genome_id":"555461.3","genome_name":"Mycobacterium tuberculosis '98-R604 INH-RIF-EM'","taxon_id":555461,"genus":"Mycobacterium"},{"contigs":181,"genome_id":"515616.6","genome_name":"Mycobacterium tuberculosis 02_1987","taxon_id":515616,"genus":"Mycobacterium"},{"contigs":58,"genome_id":"515616.7","genome_name":"Mycobacterium tuberculosis 02_1987 [PRJNA238921]","taxon_id":515616,"genus":"Mycobacterium"},{"taxon_id":1773,"contigs":102,"genome_name":"Mycobacterium tuberculosis 10107-01","genome_id":"1773.366","genus":"Mycobacterium"},{"contigs":100,"genome_id":"1438833.3","genome_name":"Mycobacterium tuberculosis 1010SM","taxon_id":1438833,"genus":"Mycobacterium"},{"contigs":267,"genome_id":"1279039.3","genome_name":"Mycobacterium tuberculosis 1034","taxon_id":1279039,"genus":"Mycobacterium"},{"taxon_id":1773,"contigs":147,"genome_name":"Mycobacterium tuberculosis 10507-09","genome_id":"1773.266","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":157,"genome_name":"Mycobacterium tuberculosis 10529-05","genome_id":"1773.365","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":158,"genome_name":"Mycobacterium tuberculosis 10530-05","genome_id":"1773.287","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":178,"genome_name":"Mycobacterium tuberculosis 10721-03","genome_id":"1773.301","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":185,"genome_name":"Mycobacterium tuberculosis 10734-04","genome_id":"1773.275","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":168,"genome_name":"Mycobacterium tuberculosis 10735-04","genome_id":"1773.331","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":158,"genome_name":"Mycobacterium tuberculosis 10737-02","genome_id":"1773.322","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":140,"genome_name":"Mycobacterium tuberculosis 10812-03","genome_id":"1773.307","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":156,"genome_name":"Mycobacterium tuberculosis 10836-09","genome_id":"1773.314","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":158,"genome_name":"Mycobacterium tuberculosis 11251-09","genome_id":"1773.350","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":181,"genome_name":"Mycobacterium tuberculosis 11458-02","genome_id":"1773.283","genus":"Mycobacterium"},{"contigs":87,"genome_id":"1438835.3","genome_name":"Mycobacterium tuberculosis 1173CS","taxon_id":1438835,"genus":"Mycobacterium"},{"taxon_id":1773,"contigs":166,"genome_name":"Mycobacterium tuberculosis 1232-02","genome_id":"1773.323","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":117,"genome_name":"Mycobacterium tuberculosis 12448-03","genome_id":"1773.296","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":171,"genome_name":"Mycobacterium tuberculosis 12615-95","genome_id":"1773.312","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":163,"genome_name":"Mycobacterium tuberculosis 1312-05","genome_id":"1773.344","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":176,"genome_name":"Mycobacterium tuberculosis 1314-04","genome_id":"1773.268","genus":"Mycobacterium"},{"taxon_id":1773,"contigs":161,"genome_name":"Mycobacterium tuberculosis 1339-07","genome_id":"1773.320","genus":"Mycobacterium"},{"contigs":84,"genome_id":"1438837.3","genome_name":"Mycobacterium tuberculosis 1429BH","taxon_id":1438837,"genus":"Mycobacterium"}]}}

nconrad commented 8 years ago

Ah, ok, yeah that does work. How would you do an union of searches on particular fields?

I.e.,

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&or(eq(genome_name,mycobacterium%20tuberculosis),eq(genome_id,mycobacterium%20tuberculosis),eq(genus,mycobacterium%20tuberculosis),eq(taxon_id,mycobacterium%20tuberculosis),eq(contigs,mycobacterium%20tuberculosis))

dmachi commented 8 years ago

You were doing it correctly, however I think the issue is that a couple of those fields (taxon_id, contigs) are integers and you can't do a string query against them.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&or(eq(genome_name,mycobacterium+tuberculosis),eq(genome_id,mycobacterium+tuberculosis),eq(genus,mycobacterium+tuberculosis))

nconrad commented 8 years ago

ah! :( My thought behind doing this was that the general table search would only work on columns displayed to the user.

nconrad commented 8 years ago

Ok, so I talked to @dmachi and he will see if there is a way to search against strings and ints at the same time. In the meantime, I'm going deploy a temp "fix" for the second issue (the space issue) by searching only strings.

nconrad commented 8 years ago

@dmachi, I'm still having problems understanding the API with regards to substring matching. I think what we want (by default) is exact substring matching. In the above example we have this, which appears to work at first glance:

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,mycobacterium+tuberculosis)

Note: the result is actually different if you use a "%20" instead of "+". 1946 vs 2299 results, and I'm not sure entirely sure what query is being made for both of them. You'll see with "%20", you get " "Mycobacterium africanum GM041182". My guess is that "taxon_lineage_names" is being searched in addition to "genome_name"?

Another issue is that the following doesn't work for substring matching the word "mycobacterium tuberculosis":

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,mycobacterium+tub)

I think you mentioned that quotes may work, but I'm unsure how?

Is it like this?

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,%27*mycobacterium%20tub*%27)

That would be fine, but it produces the same issue as above, searching other fields, perhaps?

nconrad commented 8 years ago

I tried many variants, like this... but I really don't know what I'm doing.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,"*mycobacterium%20tub*")

dmachi commented 8 years ago

Its not clear to me what you actually need to work though. The exact meaning of these comparisons depends on the underlying solr config for that core. I don't know that all types allow substring matching. There is also the text matching which basically can match against any of the textual components of the record and is often what is needed.

On Oct 14, 2015, at 7:02 PM, nconrad wrote:

I tried many variants, like this... but I really don't know what I'm doing.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,"*mycobacterium%20tub*")

� Reply to this email directly or view it on GitHub.

dmachi commented 8 years ago

This seems to work fine for substring matching:

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(2500)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium%20african*)

This gets everything that starts with "mycobacertium african".

When you you use a plus, it it does exact match for both substrings to ensure they are in the query:

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium+tuberculosis)

That is to say the results include both tuberculosis and mycobacterium in the genome name

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium+tubercul)

This returns 0 results, because there is mycobacterium but there is no exact substring match for "tubercul"

On Oct 14, 2015, at 7:02 PM, nconrad wrote:

I tried many variants, like this... but I really don't know what I'm doing.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,"*mycobacterium%20tub*")

� Reply to this email directly or view it on GitHub.

nconrad commented 8 years ago

ok, then any idea what is happening here?

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium%20tub*)

This sort of substring matching appears to be an issue on the PATRIC site as well. I think that when you search for "mycobacterium tuberculosis" on the model reconstruct page, you shouldn't get results with Mycobacterium bovis*

nconrad commented 8 years ago

I want to do the things you describe :). But something else is happening :(

dmachi commented 8 years ago

All that said, there are things I can understand in this still. That comes down to how solr is interpreting the queries (you can see the solr version of the query in the result content). We'll probably need to talk to Maulik or Harry and figure out how we need to query solr to get what you want and then I can figure out how to do that (or add the ability if needed) from the data api.

On Oct 14, 2015, at 7:17 PM, Dustin Machi wrote:

This seems to work fine for substring matching:

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(2500)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium%20african*)

This gets everything that starts with "mycobacertium african".

When you you use a plus, it it does exact match for both substrings to ensure they are in the query:

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium+tuberculosis)

That is to say the results include both tuberculosis and mycobacterium in the genome name

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium+tubercul)

This returns 0 results, because there is mycobacterium but there is no exact substring match for "tubercul"

On Oct 14, 2015, at 7:02 PM, nconrad wrote:

I tried many variants, like this... but I really don't know what I'm doing.

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(25)&sort(+genome_name)&select(genome_name,genome_id,genus,taxon_id,contigs)&eq(genome_name,"*mycobacterium%20tub*")

� Reply to this email directly or view it on GitHub.

nconrad commented 8 years ago

I want everything, lol. But you are right, I'll start a thread with Maulik and Harry and go from there.

dmachi commented 8 years ago

{"q":"genome_name:mycobacterium tub*","fl":"genome_name,genome_id","sort":"genome_name asc","fq":"public:true","rows":"5000","wt":"json"}

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&and(eq(genome_name,mycobacterium),eq(genome_name,bovis)) On Oct 14, 2015, at 7:27 PM, nconrad wrote:

ok, then any idea what is happening here?

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,mycobacterium%20tub*)

{"q":"genome_name:mycobacterium tub*","fl":"genome_name,genome_id","sort":"genome_name asc","fq":"public:true","rows":"5000","wt":"json"}

So thats what that query gets converted into. I guess that maybe it is returning records with mycobacterium in the genome_name somewhere OR with "tub*" in any text field.

This sort of substring matching appears to be an issue on the PATRIC site as well. I think that when you search for ""mycobacterium tuberculosis" on the model reconstruct page, you shouldn't get results with Mycobacterium bovis*

https://www.patricbrc.org/api/genome/?http_accept=application/solr+json&limit(5000)&sort(+genome_name)&select(genome_name,genome_id)&eq(genome_name,%22mycobacterium%20tuberculosis%22)

This query with quotes ends up

{"q":"genome_name:\"mycobacterium tuberculosis\"","fl":"genome_name,genome_id","sort":"genome_name asc","fq":"public:true","rows":"5000","wt":"json"}

and returns anything with that exact full string in it.

� Reply to this email directly or view it on GitHub.

nconrad commented 8 years ago

Update: thanks to @dmachi's work/help, half of this has been improved. The second half (special chars) is on me still.

Note: searches happen by splitting the query string into words and searching for the AND of those words. Ex: the query "foo ba" turns into a search for foo* AND ba*, matching "foo bar". However, this could match other things, such as "foo cool bar"). This goes against the genome name. User can also search by ID.

samseaver commented 5 years ago

We've addressed a few special characters, so search is working as well as it could, barring any unseen issues.