Question regarding BAM header in pipeline output

alexpreynolds commented 4 months ago

Overview: Over the last several months I have been developing a web-based tool for exploring and analysing fiber-seq reads that come out of the fibertools pipeline. There are various libraries the application uses to read and render the data, including @gmod/bam. Before querying for fibers, bam-js requires reading in the header of an indexed BAM file into memory.

Problem: Some of the headers I encounter in fiber-seq BAMs have been between 50-100 MB in size, give or take. The browser hangs while downloading the full header, and, depending on how many tracks are being rendered, this can also cause the browser to crash altogether. Except for HD and SQ tags, none of the other metadata in the header appears to be of use for rendering.

Question: For internal use, I have post-processed reads by reheadering the BAMs down to only what is needed for an indexed query. There may be other criteria for filtering reads (e.g., haplotype) but my question is — to make it easier for the wider public to import their own pipeline result files into this tool — how feasible would it be to add a runtime flag to ft to render BAM files that only contain a minimal header sufficient for indexed queries?

mrvollger commented 4 months ago

I have certainly run into headers with ~1000 lines but I have never seen anything approaching 50-100MB.

Do you have an example? And do you know what upstream tools are adding these tags?

alexpreynolds commented 4 months ago

I'll see if I can get permission to share an example file, but if it is another upstream tool adding these tags (apologies, if so), I will likely just close this issue.

mrvollger commented 4 months ago

Just the header would be fine (samtools view -H) with no real data if that helps.

I guess it could be ft but ft "should" only add one line to the header per command.

alexpreynolds commented 4 months ago

Here is one example, about 80 MB (uncompressed):

https://resources.altius.org/~areynolds/public/d2_stim_sequel.fire.HAP2.header.txt.gz

Digging into this more, this is almost certainly coming from somewhere else in the pipeline.

alexpreynolds commented 4 months ago

I apologize — this data is coming from several steps outside of ft. I'm going to close this up.

mrvollger commented 4 months ago

Good to hear @alexpreynolds. Hopefully it isn't too bad to modify the other outputs.

I am very curious about the website/tool you are making. Please let me know if you are ever in a position to share it!

Cheers, Mitchell

fiberseq / fibertools-rs

Question regarding BAM header in pipeline output #55