MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

biobambam2 instead of picard-tools? #44

Open sameoldmike opened 2 years ago

sameoldmike commented 2 years ago

Is it feasible to switch to biobambam2 for the duplicate marking and BAM validation steps?

One of the main problems we have with large paleomix runs (e.g. 100 samples) is that it begins hanging and failing at these steps, especially the BAM validation steps. We have sort of hacked it to manually do the BAM validation steps using biobambam2. In my experience it is more reliable and MUCH faster -- 10-20X faster, and it never fails!

Just a suggestion for you Mikkel. Best wishes from Norway

MikkelSchubert commented 2 years ago

Hi Mike,

Thank you for the suggestion! I've actually been looking for alternatives to ValidateSamFile due to the poor performance, so I'll definitely take a look at biobambam2. However, at a glance it seems like it seems like bamvalidate performs a much more superficial validation (it only seems to check for straight up corrupt BAM files/records), so I'm not sure can be used is a full replacement for ValidateSamFile.

For duplicate marking I've switched to samtools markdup on 2.0-alpha version of paleomix, which was a pretty significant improvement, but note that this version is undergoing a lot of changes at the moment.

Cheers, Mikkel

sameoldmike commented 2 years ago

Regarding it being a superficial check, agreed that this is the case! So possibly not a great suggestion, but for practical reasons we often have to use it as an alternative to picard-tools.

By the way, on large HPC environments (e.g. computerome2 or similar), our solution for paleomix has been to run each sample in its own yaml file on its own compute node. Then we don't get all the failing BAM validation and markduplicates steps -- I still feel it's related to all the open files, but we can't seem to solve this issue when running >100 large genome samples in a single yaml on our smaller local cluster.

MikkelSchubert commented 2 years ago

In that case you could probably just disable validation outright, since the tools used in the pipeline are unlikely to generate straight up invalid files. The easiest way to do that is probably to add a return node (4 spaces indentation) at https://github.com/MikkelSchubert/paleomix/blob/0fafea3841683e1a1bcb83851392c28c1ce4ab70/paleomix/pipelines/bam/nodes.py#L34

The problem with ValidateSamFile is most likely that it creates/opens up to 8,000 files by default, one per sequence in your target genome, which it uses to track mate information (it's a really poor implementation!). And I believe that picard MarkDuplicates does something similar to track read pairs. So the problem is probably way too many open file handles.

I'll get back to you once I've figured out how to handle this. The solution might just be to implement some minimal checks myself (I already check for some things not handled by ValidateSamFile) and then just drop ValidateSamFile entirely. The tools used by paleomix are quite mature at this point, so there is less of a need to check everything all the time.

MikkelSchubert commented 2 years ago

After some consideration, I've decided to stop running ValidateSamFile as part of the pipeline going forward. The overhead is too large and it is very rare that problems are detected due to the maturity of the tools used in the rest of the pipeline.

I am not going to remove picard ValidateSamFile or swap out picard MarkDuplicates in the 1.3.x branch, since that is a significant change in methodology, but if it makes your lives easier then I can add an option to skip the validation checks while running the pipeline.

However, as I mentioned before, the 2.0-alpha release already uses samtools markdup, so you could try that version (the master branch). Just note that this branch is under active development, that the coverage/depths/coverage reports are currently disabled (you can still run the latter to by hand), and that I am planning on making more methodological changes in connection with the next major release of AdapterRemoval that I am working on (`--collapse-conservatively´ will become the standard[1] and the collapsed truncated will be written to the same file as full collapsed reads). The YAML file layout is also changing, though it should be pretty easy to convert existing YAML files to the new layout if you compare with the template that the pipeline generates.

[1] Motivated by the poor performance a merging algorithm similar to the current default in AdapterRemoval in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2579-2

sameoldmike commented 2 years ago

Thank you Mikkel!

sameoldmike commented 2 years ago

Just to be clear, have you already added the option to skip the BAM validation checks? Or is that going to be in a future release?

MikkelSchubert commented 2 years ago

I haven't added it yet.

I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off and --validation strict. What do you think?

sameoldmike commented 2 years ago

Command-line option sounds great if that makes sense for implementation. We would use it every time!

Thanks so much Mikkel

On Tue, Oct 12, 2021 at 8:21 PM Mikkel Schubert @.***> wrote:

I haven't added it yet.

I'll try to make a point release (v1.3.5) before the end of the week, that adds a "Validation" feature to the YAML file. Or maybe it'd be more useful with a command-line option? E.g. --validation off and --validation strict. What do you think?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikkelSchubert/paleomix/issues/44#issuecomment-941263023, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTSQC6JZ6JLBQN7H6BOVRLUGR4BZANCNFSM5FB2OK3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Mike D. Martin Associate professor, Holomuseomics research group https://www.ntnu.edu/museum/holomuseomics Department of Natural History Norwegian University of Science and Technology (NTNU) Google scholar https://scholar.google.com/citations?user=WosqmUMAAAAJ&hl=en DarwinPlants https://darwin-plants.com/

MikkelSchubert commented 2 years ago

I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now:

# Validate everything with ValidateSamFile; this is the default behavior
$ paleomix bam run --validation full project.yaml

# Validate only the final BAM file
$ paleomix bam run --validation partial project.yaml

# Validate nothing
$ paleomix bam run --validation off project.yaml
sameoldmike commented 2 years ago

Cool, thanks so much Mikkel. I've got it running, and it seems to be working beautifully.

Mike

On Tue, Oct 12, 2021 at 9:20 PM Mikkel Schubert @.***> wrote:

I couldn't think of anything else that needed to be fixed, so v1.3.5 is available now:

Validate everything with ValidateSamFile; this is the default behavior

$ paleomix bam run --validation full project.yaml

Validate only the final BAM file

$ paleomix bam run --validation partial project.yaml

Validate nothing

$ paleomix bam run --validation off project.yaml

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikkelSchubert/paleomix/issues/44#issuecomment-941334137, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTSQC6VYV2DDHYZYZDJGITUGSC57ANCNFSM5FB2OK3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.