BioJulia / BioAlignments.jl

Sequence alignment tools
MIT License
60 stars 24 forks source link

Multiple alignment manipulation #51

Open BioTurboNick opened 3 years ago

BioTurboNick commented 3 years ago

Need capability to manipulate a multiple sequence alignment, seems like the right place to put it.

I started working on this but it may need more thought about how it will play nice with the pairwise alignment-oriented code.

kescobo commented 3 years ago

I haven't worked much with the AlignedSequence type, but it seems like there could be an AbstractAlignedSequence and a MultiAlignedSequences <: AbstractAlignedSequence. This might be another place where explicitly defining and documenting the expected API a la https://github.com/BioJulia/BioSequences.jl/issues/140 would be useful.

I wonder if an MSA could be represented by a vector or Tuple of AlignedSequence though.

BioTurboNick commented 3 years ago

Good ideas. It could be. I'm wondering though about the strong assumption in AlignedSequence that a sequence is aligned to a single known reference. That makes a lot of sense for aligning sequencing reads to a reference genome. Not as much if you're aligning orthologs.

Maybe AlignedSequence could just be extended to have a single-sequence constructor that just assumes a reference exists that matches in all locations and gaps are all deletions against it.

kescobo commented 3 years ago

Well, something has to be the reference, right? It could just be a consensus sequence that's never directly observed, but short of actually storing every sequence, you need something that edits are defined against.

Thinking about it some more, I wonder if you could do something like

  1. when you first create an msa, you can either define a reference explicitly, or the reference is generated as a consensus sequence.
  2. if the msa is mutable, you can add additional sequences that are put in as edits against the existing reference
  3. you can call consensus!(msa) (or something) that updates the reference to the best consensus and re-calculates the edits against that.