broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 588 forks source link

create cigar-based read filter #588

Open akiezun opened 9 years ago

akiezun commented 9 years ago

feature request from @vdauwera (from https://github.com/broadinstitute/hellbender/issues/429) "Feature request: add ability to recognize a cigar pattern (to e.g. select reads with insertions> 10 bases, or reads with soft-clips, etc)."

@vdauwera please write an example commandline you'd like to be able to write (or a list of all patterns you want to be able to filter). Assign to me when done.

vdauwera commented 9 years ago

Ticket in gsa-unstable: https://github.com/broadinstitute/gsa-unstable/issues/832

If it gets implemented there we'll be sure to fwd-port to Hellbender as well.

droazen commented 7 years ago

Re-assigning to @jonn-smith, as this might be a fun one.

jonn-smith commented 7 years ago

@vdauwera can you provide some more examples of what kinds of cases you'd like to have handled?

vdauwera commented 7 years ago

There were some good basic examples in the original ticket:

Those would be the basic must-haves.

Then the next step of nice-to-haves would be to be able to find specific patterns like "D followed by I" or specific numbers of operators like "exactly five D in a row" or "five D in total, not necessarily in consecutive order".

Do you need me to be more specific than that?

jonn-smith commented 7 years ago

@vdauwera - I think that makes sense. We've been brainstorming ideas for how a user would actually input the filter strings and there seem to be a few options.

We can also implement some combination of these. What do you think?

vdauwera commented 7 years ago

I like the idea of the modified regexes, that seems like the best balance of usability and flexibility/power. I'd rather avoid having a slew of new special-cased arguments.