OceanGenomics / mudskipper

A tool for projecting genomic alignments to transcriptomic coordinates
BSD 3-Clause "New" or "Revised" License
33 stars 7 forks source link

Shuffle #22

Closed sjbaker47 closed 2 years ago

sjbaker47 commented 2 years ago

There were quite a few changes to the codebase since I've worked on the shuffle-subcommand branch, and git rebase was being weird, so I just made this to replace #11. This new branch includes fixes for:

This takes about 10 minutes to read and shuffle a 1.5 GB BAM file. I have yet to implement streaming reads into the translation functions, but I can after we're absolutely sure this works properly. So far, I've both examined the output file manually, and I've run a position and non-position sorted BAM file with equivalent records through the shuffle command, and the output looks the same. I'm curious if there's any other way of testing that anyone can think of? Please let me know.

haghshenas commented 2 years ago

Thank you Shawn. This definitely improves on the previous logic for sorting. Shuffling seems to be working properly based on my tests too. I have a few comments for now: 1) Using a fixed name for the bucket folder looks a bit dangerous. I think it would be better to tie the folder name to the output BAM filename somehow to avoid cases that two simultaneous runs of mudskipper are being called within the same working directory. 2) The command line argument --max_mem_mb does not correctly translate to the real memory usage at all. I think this is mainly because you are looking at the number of bytes of the BAM file which is a compressed file, but when reading the records they are loaded as uncompressed records. 3) Related to to item 2, if a large memory is being requested by --max_mem_mb, then the whole BAM file fits in one bucket. In this case, there is no need to waste time on bucketifying step, right? 4) In my experience, rust-htslib doesn't take advantage of multiple threads very well when reading records. One idea could be to use multiple threads during sorting... for example using this: https://docs.rs/rayon/1.5.1/rayon/slice/trait.ParallelSliceMut.html#method.par_sort_unstable

sjbaker47 commented 2 years ago

As a heads up, I've added better temporary file logic (through a crate), and I've done some more performance optimization. Frankly, the biggest speedup has been to just compile it in release mode. Now, if we don't bucket a 1.5 GB file, it's done in about 2:20 minutes, and if we do bucket it, it's about 3:50 minutes. Setting threaded writing on the bam::Writer type doesn't seem to make a difference.

I'm not sure where else to optimize here. Sorting isn't much of a drag, since it doesn't seem much longer than just hashing the name and writing it to a bucket. I suspect most of the overhead is in reading BAM records through htslib, but I'm not sure of how to make that any faster. (I know samtools can dump my test file in ~45s, so there must be some overhead somewhere, even if it's not too significant).

Thanks for all your comments so far. Let me know if you all have anything else!

gmarcais commented 2 years ago

I think at this point we should merge this PR. We can talk with Rob to see if he has any suggestion regarding htslib.

rob-p commented 2 years ago

Hi @sjbaker47 @gmarcais — I'd like to merge this PR. The one thing that we should do first is the concordance testing we discussed. That is, what is the concordance between quantification with salmon with and without the shuffle feature. One way to do this would be to do one run converting with mudskipper from the unsorted STAR bam and then another where we ask STAR to sort them bam and then run through mudskipper+shuffle. How different are the two quantifications? Can we pin the differences down to observation order of alignments? If they look very similar, we can call this initial implementation of the feature complete and merge this PR.

sjbaker47 commented 2 years ago

Working on this now. I notice that we use samtools sort on the unsorted BAM output of STAR. This is despite that STAR can output both sorted and unsorted BAM files in a single invocation using the --outSAMtype BAM Unsorted SortedByCoordinate flag. The documentation for samtools and this biostars thread both seem to imply that it's doing the same position-sort that STAR would. I'm wondering if any of you (@rob-p @gmarcais or anyone else on this PR) are familiar with differences between the two, or if there's some reason we spend the extra time waiting for samtools to complete when we run the pipeline.

If none of you think there's a difference, I'll update the pipeline so that we don't run the extra samtools sort step, and I'll do the analysis using data output from either program.

rob-p commented 2 years ago

Any updates on this concordance testing? I think it's the only thing blocking approval and merge.