I'm in the middle of making an IRMA module for Adenoviruses. I came across your repo today and thought it would be useful for that purpose (I'm definitely thinking of using it to generate consensus sequences.) The IRMA paper mentions a few filtration steps that I thought would be a natural fit (in the "Methods" section, in the "Datasets" sub-section, in the "Influenza alignment dataset" sub-sub-section, second paragraph). In particular, they mentioned:
Removing duplicate sequences
This should be the (second-)easiest of the bunch.
Removing sequences with greater than N ambiguous nucleotides
In the paper, the authors specified N=5, which may be a good default setting for Influenza A/B segments.
Removing sequences causing frame-shifts
I think this may be relatively difficult to calculate, compared to the others.
Removing short sequences
This functionality is already implemented (--remove_short), but it may be nice to have the ability to specify a percentage of the alignment as a cutoff.
I'm sorry it's taken such a long time to reply! We will look at incorporating these features. All except the frameshift seems reasonably straightforward - I'll look into it and get back to you.
I'm in the middle of making an IRMA module for Adenoviruses. I came across your repo today and thought it would be useful for that purpose (I'm definitely thinking of using it to generate consensus sequences.) The IRMA paper mentions a few filtration steps that I thought would be a natural fit (in the "Methods" section, in the "Datasets" sub-section, in the "Influenza alignment dataset" sub-sub-section, second paragraph). In particular, they mentioned:
This should be the (second-)easiest of the bunch.
In the paper, the authors specified N=5, which may be a good default setting for Influenza A/B segments.
This functionality is already implemented (
--remove_short
), but it may be nice to have the ability to specify a percentage of the alignment as a cutoff.