Issues with generate-names.pl on large data file

cmdcolin commented 9 years ago

There was a reported issue with a large (3.9 GB) gff and it has problems related to the generate-names.pl script (some features are apparently missed?)

Thread http://gmod.827538.n3.nabble.com/Gmod-ajax-generate-names-pl-not-indexing-everything-td4050318.html

To test, you can use the maker2jbrowse pipeline

gunzip CcClean_annotated2.gff.gz
bin/prepare-refseqs.pl --gff CcClean_annotation2.gff
bin/maker2jbrowse CcClean_annotated2.gff
bin/generate-names.pl

This will automatically load the GFF with separate maker tracks and create the names store. Note that maker2jbrowse creates a name store already but we can just re-run it for testing

Then open the browser and observe that things like "chitinase" are not loaded.

Email for data

cmdcolin commented 9 years ago

I don't have any ideas for what could be causing this, but I was thinking that if the problem was for some reason related to hashing (current algorithm is crc32 if I remember correctly) then maybe using a new hashing algorithm might help? Example: a fast hash algorithm https://code.google.com/p/xxhash/

cmdcolin commented 9 years ago

Might want to consider a sort of new approach like full text search if people want to index longer feature descriptions

rdhayes commented 9 years ago

Having skimmed briefly and barely starting to look into this, Bio::JBrowse::HashStore does use Digest::Crc32. I see that Digest::CRC has a base64 option.

I don't have much experience with what these are actually doing mathematics-wise, but I could probably devote some time next week to testing this out. If it produces a working names index for higher "hash bits" settings than we get now, this might be a useful secondary/optional hashstore.

Richard D. Hayes, Ph.D. Joint Genome Institute / Lawrence Berkeley National Lab http://phytozome.jgi.doe.gov

On Wed, Sep 2, 2015 at 7:22 AM, Colin Diesh notifications@github.com wrote:

Might want to consider a sort of new approach like full text search if people want to index longer feature descriptions

— Reply to this email directly or view it on GitHub https://github.com/GMOD/jbrowse/issues/626#issuecomment-137098931.

cmdcolin commented 9 years ago

I sent you a link for the file in question! The GFF is about 3.9 GB and it has embedded fasta data but even the GFF data by itself is over 3GB.

Unfortunately, even specifying higher --hashBits arguments were failing (with the error "Error: names store not found" or similar if i recall)

Feel free to try!

rdhayes commented 9 years ago

What was the highest hashBit argument that produced results (apparently incomplete)?

Who originally reported this? It would be great to have a small set of known queries that should be working, but were found to be missing, for testing purposes.

Richard D. Hayes, Ph.D. Joint Genome Institute / Lawrence Berkeley National Lab http://phytozome.jgi.doe.gov

On Wed, Sep 2, 2015 at 2:43 PM, Colin Diesh notifications@github.com wrote:

I sent you a link for the file in question! The GFF is about 3.9 GB and it has embedded fasta data but even the GFF data by itself is over 3GB.

Unfortunately, even specifying higher --hashBits arguments were failing (with the error "Error: names store not found" or similar if i recall)

Feel free to try!

— Reply to this email directly or view it on GitHub https://github.com/GMOD/jbrowse/issues/626#issuecomment-137252906.

cmdcolin commented 9 years ago

The original thread was here http://gmod.827538.n3.nabble.com/Gmod-ajax-generate-names-pl-not-indexing-everything-td4050318.html

I think the issue is mostly outlined there, I think it is pretty easy to reproduce the issues with the parameters that were in that thread

thomasvangurp commented 9 years ago

I don't think we need a new hashing algorithm, the algorithm works just fine. The location of the relevant json file is determined by taking the crc32 value of the keyword being typed, eg for "feature" this would be 534213990. If hash bits is set to 12 then we take the modulo of 534213990%12 = 6. In hexadecimal notation this is "x06", so the entry will be in /names/6.json.

the problem is the number of entries you will get for really long names, eg, for a feature name "the long feature" entries will be: t th the the l the lo .... the long feature

At some point this entry will only have one partial match, eg {"the lo":{"prefix":["the long feature"],"exact":[]}. At this point the user might still type an extra "n" and "g", but I don't think the user will type "feature" if he sees the entry is auto-completed all the way after having type "the lo". so I would suggest not adding autocomplete entries all the way until the keyword has finished, but to stop at the end of the word yielding an unique result. I have a new script that takes care of this, and also matching of words in the middle of entries.

My only problem at this point is that the results coming back from the name store are sorted somewhere, and I did not find a way to disable it. this is a problem in the case of searching for "wi" if there is an entry called "feature with name". the current algorithm will replace "wi" for "fe". This can be prevented if the array returned from the name store is not sorted. In that case the array being returned can look like this: "wi":{"partial":["wi","feature with name","exact"[]}. If the array is not sorted autocomplete will replace "wi" with the first entry in the "partial" array, yielding "wi".

Any pointers on disabling the (javascript?) sort on returning the entries from the name store would be appreciated. I can than submit my new (python) script with the changes outlined in the first paragraph. This might just work for the 3.9GB gff file.

cmdcolin commented 8 years ago

There might be something kind of particular about this dataset that cause problems.

We have run generate-names with default parameters on a fairly large number of tracks and features.

Have generated a 65GB name store(!)

Might have to do with these weird fulltext things being stored in name field. As mentioned here...jbrowse's generate-names isn't really designed for a full text search

thomasvangurp commented 8 years ago

What about http://loopj.com/jquery-tokeninput/ ? This would make searches a lot more intuitive..

cmdcolin commented 8 years ago

@thomasvangurp the link that you suggest seems to have a server side component to narrow down the results (with some database backend)

The jbrowse search box can use something similar using the "JBrowse Names REST API" (http://gmod.org/wiki/JBrowse_Configuration_Guide#JBrowse_REST_Names_API)

This jbrowse names REST API can point to a custom server side script that does full text searches for example

cmdcolin commented 8 years ago

Alternatively a plugin that actually uses that script could be made too :)

I just think that the server side is sort of important since there is just a lot of data involved in the search

thomasvangurp commented 8 years ago

Allright, that sounds good. Would it be possible to populate the search box with keywords used, as is done in the loopJ example? I could certainly write a backend script with database connectivity. @cmdcolin could you do the front-end integration once the back-end works?

cmdcolin commented 8 years ago

I am not an official jbrowse dev. I just comment overactively (I work on http://github.com/gmod/apollo). @enuggetry might be able to comment!

enuggetry commented 8 years ago

And, thanks for commenting, @cmdcolin. @thomasvangurp, I thought the existing search widget a bit clunky too. I like the idea. I'd like to improve it. I need to look into the client end a little more, but, don't think I'd look into it for another couple of weeks. I like the jb plugin idea. Need to investigate jquery integration in a jb plugin, etc. On the surface, it looks pretty straight forward. I'd probably follow the JB Names Rest API query format.

thomasvangurp commented 8 years ago

@cmdcolin neither am I, will contribute with a plugin however. @enuggetry I'll look into making a plugin once I have more time.

thomasvangurp commented 8 years ago

@enuggetry and @cmdcolin : any progress on this?

cmdcolin commented 8 years ago

@thomasvangurp I made a plugin that could help https://github.com/cmdcolin/jbrowse_elasticsearch

It is quite similar to the default "generate-names.pl" script, except it just loads data into an "elasticsearch" database on the backend, and there is a small plugin for the frontend.

I was gonna mention it before but forgot !

cmdcolin commented 7 years ago

The original case that this issue was opened for is/was sort of pathological and might just close for now. Probably refer to #634 for other search features!

GMOD / jbrowse

Issues with generate-names.pl on large data file #626