bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
70 stars 4 forks source link

Feature request: filter out missing bases in ska align #51

Closed rderelle closed 1 year ago

rderelle commented 1 year ago

The command 'ska align --filter no-ambig-or-const' seems to output all positions with at least one missing data ('-'), resulting in large alignments of constant positions. nb: same behaviour observed with the command 'ska align --filter no-const'

Using the following command lines, and a dataset of 67 TB samples, I obtained an alignment of 1Mb (100-200 nucleotides expected; mostly because the dataset contains 3 samples with very low coverage and a lot of missing data (see pictures below)):

./ska_0.3.1 build --threads 1 --min-count 5 -f list_ska_GC.txt -o GC -k 31 ./ska_0.3.1 nk --full-info GC.skf > GC.txt ./ska_0.3.1 align --filter no-ambig-or-const GC.skf > GC.fas

Screenshot 2023-07-20 at 10 23 58 Screenshot 2023-07-20 at 10 24 09

Thanks Romain

johnlees commented 1 year ago

This is the designed behaviour for these functions so not a bug as such – the current answer would be to remove the low quality samples.

But adding a filter which also ignores missing sites would be helpful here, so I can add that.

rderelle commented 1 year ago

Ok, thanks. Then another filter option would do (it's a difficult decision to remove samples while analysing an outbreak).

nb: on a side note, one could argue that 'constant' does include positions with all identical nucleotides + missing data since there is no observed variation in these positions.

Thanks again!

rderelle commented 1 year ago

nb2: but a missing data could be due to an indel. I see your point.