Feature request: filter out missing bases in ska align

rderelle commented 1 year ago

The command 'ska align --filter no-ambig-or-const' seems to output all positions with at least one missing data ('-'), resulting in large alignments of constant positions. nb: same behaviour observed with the command 'ska align --filter no-const'

Using the following command lines, and a dataset of 67 TB samples, I obtained an alignment of 1Mb (100-200 nucleotides expected; mostly because the dataset contains 3 samples with very low coverage and a lot of missing data (see pictures below)):

./ska_0.3.1 build --threads 1 --min-count 5 -f list_ska_GC.txt -o GC -k 31 ./ska_0.3.1 nk --full-info GC.skf > GC.txt ./ska_0.3.1 align --filter no-ambig-or-const GC.skf > GC.fas

Thanks Romain

johnlees commented 1 year ago

This is the designed behaviour for these functions so not a bug as such – the current answer would be to remove the low quality samples.

But adding a filter which also ignores missing sites would be helpful here, so I can add that.

rderelle commented 1 year ago

Ok, thanks. Then another filter option would do (it's a difficult decision to remove samples while analysing an outbreak).

nb: on a side note, one could argue that 'constant' does include positions with all identical nucleotides + missing data since there is no observed variation in these positions.

Thanks again!

rderelle commented 1 year ago

nb2: but a missing data could be due to an indel. I see your point.

bacpop / ska.rust

Feature request: filter out missing bases in ska align #51