Consistent chunkSizeLimit issue

keiranmraine commented 10 years ago

Hi,

I've found that I consistently have the inability to view the alignments2 track for human chromosome 21. Although I'm aware that I can increase the chunkSizeLimit (and I have to 15MB) it is even now regularly exceeded specifically on this chromosome.

I suspect that there is little that can be done, however this seems to occur when there is virtually no data in the region. I've seen these hitting 47MB

screen shot 2014-11-27 at 16 29 57

d: Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000. {message: "Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000.", _defaultMessage: "Too much data to show.", constructor: function, getInherited: function, isInstanceOf: function…}message: "Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000."__proto__: d

Tested in 1.11.3.

I can share examples privately for testing purposes.

Regards, Keiran

cmdcolin commented 10 years ago

I would definitely like to check this out. I feel like I have also seen that happen on fairly sparse bam files too.

keiranmraine commented 10 years ago

Have you a dropbox I could drop a dataset into?

cmdcolin commented 10 years ago

I sent you a link by email

vivekkrish commented 10 years ago

Just an FYI, the iPlantCollaborative provides a free to use data store (http://www.iplantcollaborative.org/ci/data-store) with quite a lot of capacity (100GB to start with).

One of the nice things about this data store is that BAM (and corresponding BAI) files stored here can be easily streamed to any genome browser of your choice (JBrowse, IGB, IGV, etc.). And the service they provide is CORS compliant and therefore will work on any genome browser running on any server. (documentation here: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Sending+Genome+Files+to+the+Genome+Browser)

We routinely use this for hosting and sharing files with collaborators.

keiranmraine commented 10 years ago

@vivekkrish, thanks for this info but unless it's possible to use authenticated access I can't use that as the data I have exhibiting this issue is human and ethics etc. prevents me from putting it on publicly accessible resources.

keiranmraine commented 10 years ago

@cmdcolin I've dropped the files in, you only need chromosome 21 from the reference but I put it all (compressed) in so that it would match the bam header.

vivekkrish commented 10 years ago

@keiranmraine, the data is private to you, associated to your personal iPlant account. They have ACLs in place that let you send sharing invites for specific files/folders to other registered iPlant user accounts.

However, they do provide this nice capability to generate unique shareable URLs for individual files (BAM, GFF3, etc.) for streaming to Genome Browsers, which can be accessed only by the users with whom you have explicitly shared the link (similar to how it works in Google Drive and/or Dropbox).

cmdcolin commented 10 years ago

Hi Keiran, what I am seeing is that the "stats estimation" step results in too much data being downloaded on this file. The procedure for stats estimation in JBrowse seems to be (1) select a place somewhere in the middle of the chromosome (2) download 100 base pair range of data (3) if there are not enough features, double the interval and retry the procedure. In this case, it seems that there is not enough data sampled until it is getting a window the size of 3,276,800 base pairs, and at that point something is exceeding chunkSizeLimit. So, perhaps there should be a limit on the stats estimation

cmdcolin commented 10 years ago

Proposed patch to add some randomness and a retry limit...

diff --git a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
index 137e3ad..b797033 100644
--- a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
+++ b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
@@ -22,10 +22,21 @@ return declare( null, {
         var deferred = new Deferred();

         refseq = refseq || this.refSeq;
+        var retries = 0;
+        var sampleCenter = refseq.start*0.75 + refseq.end*0.25;

         var statsFromInterval = function( length, callback ) {
             var thisB = this;
-            var sampleCenter = refseq.start*0.75 + refseq.end*0.25;
+            var reset = false;
+            if( length>10000 ) {
+                length = 100;
+                retries++;
+                sampleCenter = Math.round(Math.random()*(refseq.end));
+            }
+            if( retries>10 ) {
+                callback.call( thisB, length,  null, "Failed to estimate stats" );
+                return;
+            }

cmdcolin commented 10 years ago

My patch may want to allow length>10000 since this is also called for GFFs which might be less feature dense (the stats calculation waits until 300 features are found). Apparently the limit that is in JBrowse by default is stopping when the whole chromosome is read but for BAM, but this seems prohibitively large and then causes these chunkSizeLimit errors

I guess a lesson here might be to make this configurable as well (many of these values are hard coded)

keiranmraine commented 10 years ago

@cmdcolin, thanks for looking into this. Is there a time frame for the 1.11.6 point release? I'm especially in need of the XS read colouring fix for our production system.

@vivekkrish, thanks for the info that's very useful. I'll have to confirm with our PI that he is happy with using this but I expect it will be a very useful tool.

cmdcolin commented 9 years ago

I'm not really sure my patch is the right approach, I think something has to be done to improve this more fundamentally

GMOD / jbrowse

Consistent chunkSizeLimit issue #540