Simplified file sets for data generated by flatfile-to-json.pl

keiranmraine commented 8 years ago

@billzt, @cmdcolin, relates to #780.

Currently when a gff3 file is converted to a gene/transcript track with flatfile-to-json.pl a folder and a minimum of 2 data files are generated per chromosome. For human GRCh37 gene/transcript track with decoy and scaffolds that comes to 984 lf-*.jsonz and 99 hits-*.jsonz.

Have you thought about using tabix in a more novel way?

We use tabix to make pre-generated data structures easily accessible, specifically for gene data (everything after the first 3 columns is custom, but column 5 contains a perl data structure for the transcript):

1       29553   31097   ENST00000473358 MIR1302-10      712     $VAR1 = bless( {'_genomicminpos' => 29554,'_accversion' => 1,'_ccds' => undef,'_dbvers
1       30266   31109   ENST00000469289 MIR1302-10      535     $VAR1 = bless( {'_genomicminpos' => 30267,'_accversion' => 1,'_ccds' => undef,'_dbvers

You could build a standard JSON structure for each gene but write it to file as

chr 1-start 1-end JSON

1 line per gene, and then bgzip and index with tabix:

bgzip lf.json
tabix -s 1 -b 2 -e 3 lf.json.gz
bgzip hist.json
tabix -s 1 -b 2 -e 3 hist.json.gz

This would replacte the 1000+ files with 4 for the whole genome. lf.json.gz.tbi and hist.json.gz.tbi

Even if one file is maintained per chromosome this would still reduce down to 184 (46chr*4)

cmdcolin commented 8 years ago

Technically what you propose is like a BED file with a bunch of info encoded in the 4th column. Since BED and BEDTabix is now mainline, there's nothing blocking using a format like this, but it just needs some code to convert into this format and interpret it.

rbuels commented 6 years ago

Any implementation of this would need to be careful that the "old" format made by flatfile-to-json.pl still worked in the browser.

GMOD / jbrowse

Simplified file sets for data generated by flatfile-to-json.pl #785