GMOD / jbrowse

JBrowse 1, a full-featured genome browser built with JavaScript and HTML5. For JBrowse 2, see https://github.com/GMOD/jbrowse-components.
http://jbrowse.org
Other
463 stars 199 forks source link

Trix index support #1035

Open Yating-L opened 6 years ago

Yating-L commented 6 years ago

It would be nice to support Trix index (ixx) for the name index. Current generate-names.pl will create a lot of files which may cause problems in JBrowse transferring, downloading and storage.

UCSC uses Trix index for fast look-up free text: https://genome.ucsc.edu/goldenpath/help/trix.html Utility: ixIxx - Create indices for simple line-oriented file of format, can download from http://hgdownload.soe.ucsc.edu/admin/exe/

rbuels commented 6 years ago

Implementation sketch:

Primary advantages over current Hash implementation are that there are only 2 files for the index, instead of a big file tree.

cmdcolin commented 6 years ago

trix store is implemented by dalliance browser so that is an interesting point of reference.

In their sample browser, you have things like this for the data in the trix index associating a name with other feature identifiers

eif4a1 ENST00000293831.8,1 ENST00000380512.5,1 ENST00000396527.3,1 ENST00000577269.1,1 ENST00000577731.1,1 ENST00000577738.1,1 ENST00000577929.1,1 ENST00000578324.1,1 ENST00000578476.1,1 ENST00000578495.1,1 ENST00000578569.1,1 ENST00000578754.1,1 ENST00000579085.1,1 ENST00000579139.1,1 ENST00000580461.1,1 ENST00000580886.1,1 ENST00000580888.1,1 ENST00000581384.1,1 ENST00000581544.1,1 ENST00000581770.1,1 ENST00000581808.1,1 ENST00000581841.1,1 ENST00000582050.1,1 ENST00000582169.1,1 ENST00000582213.1,1 ENST00000582746.1,1 ENST00000582848.1,1 ENST00000583217.1,1 ENST00000583389.1,1 ENST00000583802.1,1 ENST00000583899.1,1 ENST00000584054.1,1 ENST00000584712.1,1 ENST00000584784.1,1 ENST00000584798.1,1 ENST00000584860.1,1 ENST00000584901.1,1 ENST00000585024.1,1
eif4a1p1 ENST00000420241.1,1
eif4a1p10 ENST00000428832.2,1
eif4a1p11 ENST00000451239.1,1
eif4a1p12 ENST00000551910.1,1
eif4a1p13 ENST00000415667.1,1
eif4a1p2 ENST00000422633.1,1 ENST00000545933.1,1
eif4a1p3 ENST00000411521.1,1
eif4a1p5 ENST00000428062.1,1
eif4a1p6 ENST00000448133.1,1
eif4a1p7 ENST00000421800.1,1

That's in the .ix file. Then, obtaining this "match" in the trix index, it actually goes back to the bigbed file for what they call the "extra index" (see BBIExtraIndex.prototype.lookup in their codebase) for the feature location

The UCSC documents talk about extra indexes here too and allow extra indexes on arbitrary fields

https://genome.ucsc.edu/goldenpath/help/bigBed.html

I guess I just wanted to show that because it sort of is a question of whether we want to "wrap trix" like we talked before to have location data in the trix file

cmdcolin commented 6 years ago

Also see this thread about the concept of partial match searches https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome-mirror/loZy2Ps7sDU