amnh / PCG

𝙋𝙝𝙮𝙡𝙤𝙜𝙚𝙣𝙚𝙩𝙞𝙘 𝘾𝙤𝙢𝙥𝙤𝙣𝙚𝙣𝙩 𝙂𝙧𝙖𝙥𝙝 ⸺ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Implement partitioned FASTA #135

Open recursion-ninja opened 5 years ago

recursion-ninja commented 5 years ago

Details:

A partitioned FASTA file includes one or more # characters in each sequence. This means that the sequences of each taxaon in he file will have exactly the same number of # characters in their corresponding sequence.

Each # breaks the dynamic character into two separate characters which will be aligned and optimized independently. Because the dynamic characters are all in the same file, they will default to being in the same block for network optimizations.

For example, the following FASTA file:

> Alpha
ACCT#GATT#CATTAG
> Bravo
CCT#GAT#CATAG
> Charlie
ACC#ATTT#CATTAG

Would return the following Map String [String]:

Map.FromList
  [ (Alpha  , [ "ACT", "GATT", "CATTAG" ])
  , (Bravo  , [ "CCT", "GAT" , "CATAG"  ])
  , (Charlie, [ "ACC", "ATTT", "CATTAG" ])
  ]

Which represents each taxon having 3 dynamic characters in a single block.

We should also decide if we will allow empty partitions in a FASTA file.

For example, would the following file be allowed:

> Alpha
ACT#ATT#CAT
> Bravo
ACT##CAT
> Charlie
ACT#ATT#
> Delta
#ATT#CAT

In the above we can see that Bravo, Charlie, and Delta all have empty partitions.

Implementation:

The FASTA parser already exists in a very usable state and accepts # characters, though it does not currently interpret them in the special way described above.

We should perform a post-parsing pass over the FASTA data. If any sequence as one or more # chars present, we will enforce that all sequences have the same number of # present or raise a parse error.

We should make the parse error as human readable as possible. For example if all but one sequence had four # chars and the other sequence had a different number, the parse error should focus the user's attention to only the outlier sequence.