GMOD / jbrowse

JBrowse 1, a full-featured genome browser built with JavaScript and HTML5. For JBrowse 2, see https://github.com/GMOD/jbrowse-components.
http://jbrowse.org
Other
463 stars 199 forks source link

flatfile-to-json.pl - insufficient data yields incomplete histograms #612

Closed enuggetry closed 7 years ago

enuggetry commented 9 years ago

When loading flat genome data through “flatfile-to-json.pl”, JBrowse uses some form of heuristic method to generate histograms. Unfortunately, in the absence of sufficient data, the method outputs either incomplete or non-existent histograms. In the worst-case scenario, users of the application will wind up at an “infinite” loading box for feature histograms or may even encounter large blank spaces on a given chromosome within the JBrowse genome viewer, which [falsely] suggests that a given track does not have data.

Submitted by Mary Shimoyama

cmdcolin commented 9 years ago

I disucssed this bug via email with Aurash a long time ago. Essentially, I think they just had a bad track configuration.

They said in their email that they also had this error:

From the console
Warning: "Unable to determine an appropriate data store to use with track 'undefined', please explicitly specify a storeClass in the configuration." (dojo.js:1050)

From the JavaScript console
Error: "TypeError: a is not a constructor" (dojo.js:32)

You can search this error in the jbrowse source code, and it looks like it indicates a track config problem. I think it would probably also lead to the type error after that, and when you have fatal javascript typeerrors, then it halts the interpreter so you get weird things like "infinite loading bars" and other things

Therefore, I think the issue with the histograms is just a red herring for the bad track config

cmdcolin commented 9 years ago

Full text of emails http://pastebin.com/eR6iuKvf If they have a live example showing the bug that would help

enuggetry commented 9 years ago

Thanks. Noted. On Jul 10, 2015 11:24 AM, "Colin Diesh" notifications@github.com wrote:

Full text of emails http://pastebin.com/eR6iuKvf If they have a live example showing the bug that would help

— Reply to this email directly or view it on GitHub https://github.com/GMOD/jbrowse/issues/612#issuecomment-120487063.

halfwayBraindead commented 9 years ago

Gents,

I do not think this issue stems from an allegedly faulty track configuration, as that would [only] hamper actual display of track features within the genome browser. Even when JBrowse attempts to write the actual histogram files following track insertion via "flatfile-to-json.pl", it fails to do so--see below:

-bash-3.2$ echo && ls -lhv ./tracks/DEBUG_25/*

./tracks/DEBUG_25/Chr1: total 664K -rw-r--r-- 1 rgdpub rgdpub 120 Jul 20 18:18 hist-5000000-0.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-1.json -rw-r--r-- 1 rgdpub rgdpub 44K Jul 20 18:18 lf-2.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-3.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-4.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-5.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-6.json -rw-r--r-- 1 rgdpub rgdpub 40K Jul 20 18:18 lf-7.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-8.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-9.json -rw-r--r-- 1 rgdpub rgdpub 41K Jul 20 18:18 lf-10.json -rw-r--r-- 1 rgdpub rgdpub 32K Jul 20 18:18 lf-11.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-12.json -rw-r--r-- 1 rgdpub rgdpub 39K Jul 20 18:18 lf-13.json -rw-r--r-- 1 rgdpub rgdpub 44K Jul 20 18:18 lf-14.json -rw-r--r-- 1 rgdpub rgdpub 19K Jul 20 18:18 names.txt -rw-r--r-- 1 rgdpub rgdpub 2.6K Jul 20 18:18 trackData.json

./tracks/DEBUG_25/Chr2: total 340K -rw-r--r-- 1 rgdpub rgdpub 116 Jul 20 18:18 hist-5000000-0.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-1.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-2.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-3.json -rw-r--r-- 1 rgdpub rgdpub 40K Jul 20 18:18 lf-4.json -rw-r--r-- 1 rgdpub rgdpub 16K Jul 20 18:18 lf-5.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-6.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-7.json -rw-r--r-- 1 rgdpub rgdpub 22K Jul 20 18:18 lf-8.json -rw-r--r-- 1 rgdpub rgdpub 10K Jul 20 18:18 names.txt -rw-r--r-- 1 rgdpub rgdpub 2.5K Jul 20 18:18 trackData.json

./tracks/DEBUG_25/Chr3: total 404K -rw-r--r-- 1 rgdpub rgdpub 29K Jul 20 18:18 lf-1.json -rw-r--r-- 1 rgdpub rgdpub 77K Jul 20 18:18 lf-2.json -rw-r--r-- 1 rgdpub rgdpub 35K Jul 20 18:18 lf-3.json -rw-r--r-- 1 rgdpub rgdpub 32K Jul 20 18:18 lf-4.json -rw-r--r-- 1 rgdpub rgdpub 29K Jul 20 18:18 lf-5.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-6.json -rw-r--r-- 1 rgdpub rgdpub 48K Jul 20 18:18 lf-7.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-8.json -rw-r--r-- 1 rgdpub rgdpub 30K Jul 20 18:18 lf-9.json -rw-r--r-- 1 rgdpub rgdpub 7.7K Jul 20 18:18 names.txt -rw-r--r-- 1 rgdpub rgdpub 2.3K Jul 20 18:18 trackData.json

./tracks/DEBUG_25/Chr4: total 324K -rw-r--r-- 1 rgdpub rgdpub 46K Jul 20 18:18 lf-1.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-2.json -rw-r--r-- 1 rgdpub rgdpub 33K Jul 20 18:18 lf-3.json -rw-r--r-- 1 rgdpub rgdpub 39K Jul 20 18:18 lf-4.json -rw-r--r-- 1 rgdpub rgdpub 44K Jul 20 18:18 lf-5.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-6.json -rw-r--r-- 1 rgdpub rgdpub 40K Jul 20 18:18 lf-7.json -rw-r--r-- 1 rgdpub rgdpub 6.9K Jul 20 18:18 names.txt -rw-r--r-- 1 rgdpub rgdpub 2.3K Jul 20 18:18 trackData.json

./tracks/DEBUG_25/Chr5: total 356K -rw-r--r-- 1 rgdpub rgdpub 38K Jul 20 18:18 lf-1.json -rw-r--r-- 1 rgdpub rgdpub 46K Jul 20 18:18 lf-2.json -rw-r--r-- 1 rgdpub rgdpub 40K Jul 20 18:18 lf-3.json -rw-r--r-- 1 rgdpub rgdpub 43K Jul 20 18:18 lf-4.json -rw-r--r-- 1 rgdpub rgdpub 49K Jul 20 18:18 lf-5.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-6.json -rw-r--r-- 1 rgdpub rgdpub 47K Jul 20 18:18 lf-7.json -rw-r--r-- 1 rgdpub rgdpub 20K Jul 20 18:18 lf-8.json -rw-r--r-- 1 rgdpub rgdpub 8.9K Jul 20 18:18 names.txt -rw-r--r-- 1 rgdpub rgdpub 2.3K Jul 20 18:18 trackData.json

Please take special note of how chromosomes 3 through 5 lack histograms entirely--as our team suspects, this may be stemming from a logic issue (or set of issues) within the "flatfile-to-json.pl" script--it is somehow failing to generate histograms on those chromosomes, despite actual features being present.

Further, "flatfile-to-json.pl" is not throwing any warnings or errors while loading these tracks--it makes every indication of a successful track loading process.

So, what does this actually look like in our development instance of JBrowse?

missinghistograms

The top-most track is the genes and transcripts track for the rn5 genome, at full feature density. The "DEBUG_10/25/50" tracks represent a 10x, 25x, and 50x reduction in feature density for the top-most track, respectively, to which the bottom-most track possesses only 2% of its original feature density.

As can be seen, the histograms predictably grow sparser and coarser as feature density diminishes, until they drop off completely in "DEBUG_50", but is that track truly devoid of gene features?

missinghistograms_2

No, it is not empty, and JBrowse should still be ideally generating histograms for this track.

GZipped copies of the GFF3s used in this test can be found below:

DEBUG_10 DEBUG_25 DEBUG_50

The genome assembly used as the reference sequence for this test was Rnor v5.0 (e.g. Rat 5) from NCBI.

Finally, please note that this test was performed on the latest public release of JBrowse, version 1.11.6.

enuggetry commented 9 years ago

Thanks for the clear illustration, @halfwayBraindead, I see what you're saying.

cmdcolin commented 9 years ago

I suppose this is indeed probably a confirmed bug. Good test data. Let me know if I can help fix but I think that I can indeed generate a fail case

halfwayBraindead commented 9 years ago

I've been spending some time trying to workaround this histogram problem, and I generated BigWigs from the GFF3s in question--but, when using the NCList storeClass of JBrowse, it does not appear (from my use cases in 1.11.6) to support custom histogram specification.

It is well-known that the BAM storeClass allows for custom user specification of histograms, such as BigWigs, and to my delight, the GFF3 storeClass (enabling direct reading of GFF3s without needing to use "flatfile-to-json.pl" apriori) also supports custom histogram specification!

Unfortunately, after attempting to load a production-grade GFF3 (on the order of hundreds of Megabytes), the genome browser crashed even more rapidly than it does with poorly configured BAM data--in other words, the GFF3 parser of JBrowse is non-ideal. It's pretty slow, and even the source commenting of its main method declares that it requires significant refactoring.

In any case, here's an idea that could enable a workaround solution without needing too much effort from the JBrowse team: why not enable custom histogram specification for the NCList storeClass? This sort of code already exists for BAM and GFF3 storeClasses, so why not port this code over to the NCList storeClass as an optional user definition?

That way, users can generate their own Bedgraphs and Wiggles; they can define and take responsibility for their own histograms without requiring an extensive examination and/or re-write of the existing histogram generation method(s) in "flatfile-to-json.pl".

This suggestion is being made with the knowledge in-mind that JBrowse development resources are limited. What do you think?

selewis commented 9 years ago

+1

On Fri, Jul 24, 2015 at 8:36 AM, halfwayBraindead notifications@github.com wrote:

I've been spending some time trying to workaround this histogram problem, and I generated BigWigs from the GFF3s in question--but, when using the NCList storeClass of JBrowse, it does not appear (from my use cases in 1.11.6) to support custom histogram specification.

It is well-known that the BAM storeClass allows for custom user specification of histograms, such as BigWigs, and to my delight, the GFF3 storeClass (enabling direct reading of GFF3s without needing to use " flatfile-to-json.pl" apriori) also supports custom histogram specification!

Unfortunately, after attempting to load a production-grade GFF3 (on the order of hundreds of Megabytes), the genome browser crashed even more rapidly than it does with poorly configured BAM data--in other words, the GFF3 parser of JBrowse is non-ideal. It's pretty slow, and even the source commenting of its main method declares that it requires significant refactoring.

In any case, here's an idea that could enable a workaround solution without needing too much effort from the JBrowse team: why not enable custom histogram specification for the NCList storeClass? This sort of code already exists for BAM and GFF3 storeClasses, so why not port this code over to the NCList storeClass as an optional user definition?

That way, users can generate their own Bedgraphs and Wiggles; they can define and take responsibility for their own histograms without requiring an extensive examination and/or re-write of the existing histogram generation method(s) in "flatfile-to-json.pl".

This suggestion is being made with the knowledge in-mind that JBrowse development resources are limited. What do you think?

— Reply to this email directly or view it on GitHub https://github.com/GMOD/jbrowse/issues/612#issuecomment-124560784.

cmdcolin commented 9 years ago

I think I found a patch that makes the histogram section of the config file usable with tracks that were run with flatfile-to-json. You can check it out on the master jbrowse branch

The diff:

diff --git a/src/JBrowse/View/Track/CanvasFeatures.js b/src/JBrowse/View/Track/CanvasFeatures.js
index 66f7775..dcd8a70 100644
--- a/src/JBrowse/View/Track/CanvasFeatures.js
+++ b/src/JBrowse/View/Track/CanvasFeatures.js
@@ -362,7 +362,7 @@ return declare(
             basesPerBin: basesPerBin
         };

-        if( this.store.getRegionFeatureDensities ) {
+        if( !this.config.histograms.store&&this.store.getRegionFeatureDensities ) {
halfwayBraindead commented 9 years ago

Checked out and tested the "Master" branch, and the histogram storeClass works well now within NCList tracks! A definite plus.

Another significant issue with this approach cropped up, though:

histogram_scaling

As can be seen, the scaling of BigWig histogram bars seems "off", and even when inserting a basic BigWig track (whole track dedicated to BigWig file) from the configuration guide, the same issue appears:

histogram_scaling2

Turns out that the "autoscale" parameter resolves this scaling issue for BigWig tracks, though--the configuration guide claims that it's set to the value of "local" by default, but it appears not to be in the "Master" branch (default appears to be the "global" value):

histogram_scaling3

Unfortunately, no counterpart method appears to exist for the histogram storeClass within CanvasFeatures tracks such as GFF3/BAM/NCList - would be great to port that sort of code over!

(And, also, to greatly expand the feature capability of the histogram storeClass within CanvasFeatures tracks: as Rob wrote previously, "all you can really change about how it looks is its color". Such a venture would likely tie-in closely with #624.)

cmdcolin commented 7 years ago

I think the original issue that this thread was about got fixed! The configuration of bigwigs and bigwig summaries on feature tracks probably another one