tinyRNA provides an all-in-one solution for precision analysis of sRNA-seq data. At the core of tinyRNA is a highly flexible counting utility, tiny-count, that allows for hierarchical assignment of reads to features based on positional information, extent of feature overlap, 5’ nucleotide, length, and strandedness.
GNU General Public License v3.0
1
stars
1
forks
source link
tiny-count: new edit pattern modes for the Mismatches selector #337
Rules that define a mismatch requirement can be extended to require a specific edit pattern. Two choices are available for this parameter:
ADAR: all mismatches must follow the A → I edit pattern which is characteristic of the double-stranded RNA-specific adenosine deaminase (ADAR) enzyme family. Inosene is recognized as guanosine by reverse transcriptase and therefore represented as G when sequenced, so this pattern is represented as A → G in sequencing data.
TUT: all mismatches must follow the N → U edit pattern at the 3' terminus which is characteristic of the Terminal Uridylyl Transferase (TUT) enzyme family. Mismatches must be consecutive from the 3' end. Reverse transcription prior to sequencing means this pattern is represented as N → T in sequencing data.
This option applies globally to all rules except those that lack a mismatch requirement. Rules without the requirement will continue to allow any number of mismatches following any edit pattern.
This PR contains the following additional changes:
Mismatches, for now, can no longer be calculated from the CIGAR string. All alignments must therefore report an NM tag, and if the tag is missing, it is treated as an error. The prior CIGAR method failed to disambiguate the M match operator.
Soft clipped bases are now honored and reflected in the sequence (and therefore 5' NT), start, and length.(Edit 6/10: after further discussion, there is more work to be done in reconciling inconsistencies across selectors when using this approach. This might be revisited in a future issue.)
Selectors that use the NumericalMatch class, including the Length and Mismatches selectors, are more forgiving when parsing and validating definitions. Leading and trailing whitespace is now permissible, as well as variable whitespace around commas and hyphens. Definitions composed entirely of whitespace will continue to fail validation.
Unmapped alignments are no longer skipped by recursively calling _next_alignment() in the AlignmentIter class. Python's max recursion depth is actually quite shallow (1000), so a SAM/BAM file with 1000+ unmapped alignments would have produced a RecursionError.
Alignments are rejected if they report incompatible CIGAR operators (soft clip, hard clip, skip, and pad). If found, the exception includes the offending record number and the file's basename
Diagnostic alignment tables now include the MD string in the Mismatches column when a mismatch pattern is specified
Minor optimization in AdarEditMatch to skip parsing the "0" character when it is used as a delimiter/flank of mismatch operations (i.e., when it doesn't represent a run of matches)
Rules that define a mismatch requirement can be extended to require a specific edit pattern. Two choices are available for this parameter:
This option applies globally to all rules except those that lack a mismatch requirement. Rules without the requirement will continue to allow any number of mismatches following any edit pattern.
This PR contains the following additional changes:
Soft clipped bases are now honored and reflected in the sequence (and therefore 5' NT), start, and length.(Edit 6/10: after further discussion, there is more work to be done in reconciling inconsistencies across selectors when using this approach. This might be revisited in a future issue.)Length
andMismatches
selectors, are more forgiving when parsing and validating definitions. Leading and trailing whitespace is now permissible, as well as variable whitespace around commas and hyphens. Definitions composed entirely of whitespace will continue to fail validation.