BioJulia / BioAlignments.jl

Sequence alignment tools
MIT License
60 stars 24 forks source link

Add support for 'meta' alignment operations (`OP_PAD` and `OP_HARD_CLIP`) #64

Closed MillironX closed 1 year ago

MillironX commented 2 years ago

Types of changes

This PR implements the following changes:

:clipboard: Additional detail

Adds support for the OP_PAD operation, and fixes the implementation of the OP_HARD_CLIP operation. I have dubbed these two operations 'meta' operations, as they contain information about the origin and context of the alignment, but have no bearing on the content (i.e. reference or query) of the alignment.

New features

  1. Added function ismetaop, which returns true if passed OP_PAD or OP_HARD_CLIP
  2. Creating an alignment with a OP_PAD operation no longer throws an error
  3. Added validity check for OP_HARD_CLIP or OP_SOFT_CLIP in the wrong place According to the SAM specification, these have to be present at the end of an alignment, with soft clips optionally buffering hard clips from the rest of the alignment

Changed functionality

  1. cigar(::Alignment) now uses the AlignmentAnchor.alnpos to print operation length
  2. OP_HARD_CLIP is no longer considered an insert operation, and isinsertop(OP_HARD_CLIP) now returns false
  3. The help text for OP_PAD indicates it's supported now
  4. Trying to create an alignment with a hard or soft clip in an improper place will throw an error (I know I said this above, but it's probably the most likely addition to break something, even though it does conform to spec)

Justification

BioAlignments should be able to handle all of the operations defined by the SAM specification. As it currently is, BioAlignments is unable even to parse the alignments in Section 1.1 "An example" of the specification. While the work on clips may seem out-of-scope for a change adding an operation support, the way clips work and pads work are very similar and needed to be considered together.

Example

Using query sequence "r002" from Section 1.1 of the SAM specification:

BioAlignments 2.0.0

julia> using BioAlignments

julia> Alignment("3S6M1P1I4M", 1, 9)
ERROR: The P CIGAR operation is not yet supported.
Stacktrace:
 [1] Alignment(cigar::String, seqpos::Int64, refpos::Int64)
...

This PR

julia> using BioAlignments

julia> Alignment("3S6M1P1I4M", 1, 9)
Alignment:
  aligned range:
    seq: 0-14
    ref: 8-18
  CIGAR string: 3S6M1P1I4M

My testing also indicates that this fixes #56,

BioAlignments 2.0.0

julia> using BioAlignments, BioSequences

julia> aligned_sequence = AlignedSequence(dna"AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAA",
       Alignment("75H25M75H", 1, 1))
---------------------------------------------------------------------------·························---------------------------------------------------------------------------
AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAAError showing value of type AlignedSequence{LongDNASeq}:
ERROR: BoundsError: attempt to access LongDNASeq at index [71]
Stacktrace:
...

This PR

julia> using BioAlignments, BioSequences

julia> aligned_sequence = AlignedSequence(dna"AGAAGTTTATCTGTGTGAACTTCTTGGCTTAGTATCGTTGAGAAGAATCGAGAGATTAGTGCAGTTTAAA",
              Alignment("75H25M75H", 1, 1))
·························
AGAAGTTTATCTGTGTGAACTTCTT

:ballot_box_with_check: Checklist

MillironX commented 2 years ago

@jakobnissen, @CiaranOMara, could I get a review from one of you?

codecov[bot] commented 2 years ago

Codecov Report

Base: 87.60% // Head: 88.20% // Increases project coverage by +0.59% :tada:

Coverage data is based on head (96f8ee7) compared to base (b92ccf3). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #64 +/- ## ========================================== + Coverage 87.60% 88.20% +0.59% ========================================== Files 16 16 Lines 1138 1178 +40 ========================================== + Hits 997 1039 +42 + Misses 141 139 -2 ``` | [Impacted Files](https://codecov.io/gh/BioJulia/BioAlignments.jl/pull/64?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia) | Coverage Δ | | |---|---|---| | [src/alignment.jl](https://codecov.io/gh/BioJulia/BioAlignments.jl/pull/64/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia#diff-c3JjL2FsaWdubWVudC5qbA==) | `84.66% <100.00%> (+4.19%)` | :arrow_up: | | [src/operations.jl](https://codecov.io/gh/BioJulia/BioAlignments.jl/pull/64/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia#diff-c3JjL29wZXJhdGlvbnMuamw=) | `88.23% <100.00%> (+0.73%)` | :arrow_up: | | [src/pairwise/alignment.jl](https://codecov.io/gh/BioJulia/BioAlignments.jl/pull/64/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia#diff-c3JjL3BhaXJ3aXNlL2FsaWdubWVudC5qbA==) | `96.99% <100.00%> (+0.41%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=BioJulia)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

jakobnissen commented 2 years ago

LGTM. Thanks for putting in the work. W.r.t #44 - we should roll back the change, then merge #72, then we can apply this. This debacle all suggests to me that moving forward, we need to be more careful about what is API and what is internals, but let's have that change later.

MillironX commented 2 years ago

I agree with @jakobnissen, we should postpone merging this until #44 and #72 are both in master, rather than reverting 349e204.

MillironX commented 1 year ago

Failed downstream is expected: this is a breaking change due to the inclusion of #44.