deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
677 stars 208 forks source link

Potential for out of order entry addition when writing bigWig files #213

Closed dpryan79 closed 8 years ago

dpryan79 commented 8 years ago

In the bigWig creation stuff, I've assumed that the bedGraph would be in the same sort order as the BAM file header. This isn't actually the case and can cause issues. The solutions are (1) write a different sort utility or (2) sort the chromosome list that goes into pyBigWig to match the file. Option 2 is certainly simpler.

steffenheyne commented 8 years ago

I just want to stress that the sort order (of bedgraph/bigwig) is important. I had once a lot of trouble to sync outputs of different tools (deeptools and others) because the didn't care about the chr order. The additional difficulty was that I coudn't simply use linux sort as some other sort magic happend internally. So it would be nice to have either the sort order of the bam index or a sort-reproducible sort order of the bedgraph/bigwig.

How is the sort order of the current bedgraph output defined?

dpryan79 commented 8 years ago

At the moment the sort is reproducible with LC_ALL=C sort -k1,1 -k2n,2 some_file. I would generally prefer to keep the sort order of the BAM file, but that's rather more difficult to do (we'd need to write a file merger, though I guess that that wouldn't be too terrible to do).

steffenheyne commented 8 years ago

ah and what happens actually in the bigwig with chromosomes appearing in the bam index but they have no values at all in the bedgraph!? It would be nice if at least these chr (with the right size) appear in the bigwig index (does something like this exists?). I think I also had once problems with another tool that striped down these chr....

Is it possible in case of "--missingDataAsZero" and writing bigwig files to have these chr also filled with zeros?

All this was important for me to get a proper (ie. have always the same number of chr in the same sort order) genome wide binning from bigwig files...

dpryan79 commented 8 years ago

All chromosomes are in the index, regardless of whether they have an entry (the size is provided by the BAM file). Anything that produces lines with 0 in a bedGraph file will do so in a bigWig file as well.

fidelram commented 8 years ago

The sort order for bigwigFiles was imposed by the USCS bedgraphToBigWig. It has to be LC_ALL=C sort -k1,1 -k2,2n. Since now we are not using bedgraphToBigWig anymore we can probably use the same sort order as bam.