broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
963 stars 367 forks source link

Enable RevertSam ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false #300

Open ktibbett opened 8 years ago

ktibbett commented 8 years ago

Currently the ATTRIBUTE_TO_CLEAR tag is only addressed when alignment information is being removed, so it's not possible to remove attributes when you are not removing alignment information. I would like to be able to do this and propose modifying RevertSam so that the code that executes attribute-clearing is outside the branch that is conditionally executed when REMOVE_ALIGNMENT_INFORMATION=true. Objections?

tfenne commented 8 years ago

I think this sounds fine - my only suggestion would be that you create two sets of defaults for the ATTRIBUTE_TO_CLEAR argument, and switch between them based on whether or not REMOVE_ALIGNMENT_INFORMATION is true, since the default list are all alignment-related tags currently and removing those while not removing alignment information would likely make a BAM useless.

eitanbanks commented 8 years ago

Hey @tfenne this is important so I'd like to circle back on that last statement you made. Why would it render a BAM useless if we removed the alignment-related tags while not removing alignment information? I don't know of anyone doing tertiary analysis (associations, e.g. the ATGU crowd) who use any of those tags at all - but they definitely use the alignment information. The GATK (and likely most other equivalent tools) does not use them for variant calling either.

Mind you, I'm not arguing that the tags are useless. What I'm saying is that there are certainly many cases where we could want an aligned BAM without any of these tags. It makes a big difference because they actually have a significant impact on file size, so that adds up when one needs to think about long term storage of > 100K whole genomes. If we can strip out the tags after the point where they're needed, that could save us a large amount of money (which could be spent on sequencing extra samples).

Since you're a thought leader in this area, I wanted to get your +1 on those thoughts.

tfenne commented 8 years ago

I see where you're coming from. I certainly have use-cases where having BAMs with alignment information but without key alignment-related attributes would be troublesome. And I believe GenomeSTRiP and potentially other callers rely on some of those tags (e.g. SA). I'm not arguing that in all pipelines one should keep them, just that the defaults for RevertSam should probably be either you remove all the alignment information or none of it, but not one half of it.

eitanbanks commented 8 years ago

Okay, works for me. Thank you!

sooheelee commented 8 years ago

Can you @tfenne @eitanbanks help me understand the new proposed features?

(1) What happens to the following with the new changes to RevertSam?

We document a case of REMOVE_DUPLICATE_INFORMATION==true and REMOVE_ALIGNMENT_INFORMATION==false. In this instance, the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates. I assume here the flag values for duplicate reads are retained but all the alignment information is removed. How is this useful?

(2) How can users clear extraneous tags that interfere with GATK tools but retain alignment information, e.g. BWA MEM alignment information?

I read the tool document as when REMOVE_ALIGNMENT_INFORMATION=false, then ATTRIBUTE_TO_CLEAR is disabled. Am I misunderstanding this and can I still specify tags to clear at this stage?

REMOVE_ALIGNMENT_INFORMATION=true, then all alignment fields (columns POS, MAPQ, CIGAR, PNEXT), all tags except RG, and all flags are cleared to unaligned state.

eitanbanks commented 8 years ago

1) It may or may not be useful. But it's possible to create such a BAM if a user ever wanted it. 2) What we're saying is that with the proposed you will be able to set REMOVE_ALIGNMENT_INFORMATION=false but not have ATTRIBUTE_TO_CLEAR be disabled. And if REMOVE_ALIGNMENT_INFORMATION=true you don't necessarily have to remove all alignment tags. The user now has flexibility to choose.

sooheelee commented 8 years ago

I'm looking forward to these changes that allow for flexibility. I would use these new parameters today if they were available so I'm keen to hear when they are instituted.

sooheelee commented 8 years ago

@eitanbanks @tfenne Hey has anything been done with this, i.e. is there a pull request associated with this discussion? I'm asking because I'll be publishing a new article on RevertSam and want to make sure I'm up to date. Thanks.

eitanbanks commented 8 years ago

Yes, I believe @ktibbett merged this a while back, but will let her confirm

ktibbett commented 8 years ago

No, I did not. We didn't need it in the short term, so it's still low down on the priority list.

sooheelee commented 8 years ago

Ok thanks. I'll let sleeping dogs sleep then.