Open ChrisHIV opened 6 months ago
Looks like those options are specific to flow-based reads, therefore they will not be used at all unless you also set --FLOW_MODE true
in your command(s).
I've tried this with the example alignments you've provided and while it did run successfully, I'd recommend extra caution if you ever decide to use these options with non-flow-based single-end reads, since I'd assume they were not intended for such use.
Thanks. Indeed using --FLOW_MODE
on that test data, --USE_END_IN_UNPAIRED_READS
changes which reads are duplicates. However, --USE_UNPAIRED_CLIPPED_END
still has no effect and it's unclear to me to why: the description suggests that it toggles whether clipped ends are included or excluded, which I think should affect where reads are deemed to start and end, and thus affect which are considered duplicates.
Maybe helpful to add to the two help messages that --FLOW_MODE
must be used? Also good to make the help clearer which of true/false corresponds to inclusion/exclusion of clipped ends for --USE_UNPAIRED_CLIPPED_END
.
I agree with adding clarification(s) to the documentation. I'll try getting more details on whether these options are safe to be used with non-flow single-end reads before closing the issue.
perhaps there should be some error if those options are used without --FLOW_MODE
turned on?
Absolutely, if these are truly only meant to be used with flow-based reads.
@meganshand @ilyasoifer tagging for opinions on this (and potentially similar issues with other options in this/other tools)
@kockan - thanks! We will discuss and propose how to best deal with this.
@meganshand and @kockan, @ChrisHIV - unless you see a clear use case for single end reads (that are not of a constant length in bases) we will update the help string as suggested by the issue to indicate that it should only be used for flow reads
@ilyasoifer Sounds reasonable to me. One small additional request: if these options are set in the command-line arguments without --FLOW_MODE
, we should throw an exception (edit: decided an exception would be better than a warning after some initial thought)
https://github.com/broadinstitute/picard/pull/1976 was merged, which adds a FLOW_ prefix to these options, but to close out this ticket we also need a check that flow mode is active when these options are specified.
@dror27 - can you please address the last comment so we can close this?
@dror27 - can you please address the last comment so we can close this?
Pull request created with requested change: https://github.com/broadinstitute/picard/pull/1980
Bug Report
Affected tool(s)
picard MarkDuplicates
with the--USE_END_IN_UNPAIRED_READS
and--USE_UNPAIRED_CLIPPED_END
optionsAffected version(s)
Latest public release version [3.1.1]
Description
The
--USE_END_IN_UNPAIRED_READS
and--USE_UNPAIRED_CLIPPED_END
options have no effect in minimal test data. From my understanding of the help messages for these options (reproduced at the bottom of this message), the former should toggle whether or not we consider unpaired reads to be duplicates if they have the same start position but different end positions, and the latter should toggle whether clipped ends of unpaired reads are included or excluded when determining duplicates (I do not understand whether inclusion/exclusion corresponds to true/false for this bool or vice versa).Steps to reproduce
The attached reads_sam.txt has 8 reads mapped to an 8-bp reference genome, attached as reference_fasta.txt. (Both files have had their extensions changed to .txt to allow attachment.) The attached image shows the reads for convenience. These reads all have the central 6bp mapped, but they vary in whether there is an additional base at one end or the other and whether that base is mapped or clipped. After renaming reads_sam.txt to reads_sam.sam to clarify the format, run e.g.
Expected behavior
The four output sam files corresponding to the four combinations of these two binary flags should vary in which subset of reads are included after removing duplicates, because the reads vary in their potential to be considered duplicates based on the description of the flags.
Actual behavior
The four output sam files are identical in their read content (containing reads 1 and 6).
Additional comments
I tried to follow the recommendation to first post on the forum, but clicking on the 'Sign in' tab takes me to this page where I cannot see any option to sign in or create a new profile.
And here is the help for those two options, for convenience: