JLSteenwyk / ClipKIT

a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference
https://jlsteenwyk.com/ClipKIT/
MIT License
63 stars 4 forks source link

Question about clarifying defaults for --gap_characters #43

Closed smorzechowski closed 8 months ago

smorzechowski commented 8 months ago

Thanks for this wonderful tool! I wanted to clarify some information in the help file about the defaults for the --gap_characters argument.

Should the default for nucleotides include N as a gap character as the information under Sequence type suggests?

  Sequence type
      Specifies the type of sequences in the input file. Valid options
      include aa or nt for amino acids and nucleotides. This argument
      is case insensitive. This matters for what characters are
      considered gaps. For amino acids, -, ?, *, and X are considered
      gaps. For nucleotide sequences, the same characters are
      considered gaps as well as N.

I just wanted to check since the defaults under --gap_characters reflect the opposite I believe.

  -gc, --gap_characters <string_of_gap_chars> specifies gap characters used in input file
                                              (default for aa: "-?*XxNn"
                                               default for nt: "-?*Xx")

Thanks very much!

Technical Details

smorzechowski commented 8 months ago

As an addendum, I found a tiny bug when trying to manually specify gap characters. I tried the following

clipkit $alignment --gap_characters "-?*XxNn"

but get this error: clipkit: error: argument -gc/--gap_characters: expected one argument

When I tried without quotes: clipkit $alignment --gap_characters -?*XxNn

I get the same error message.

However, when I specified just Nn or NnXx*?- , e.g. clipkit $alignment -gc NnXx*?- it works just fine without throwing any errors and all the characters are used in the out file!

JLSteenwyk commented 8 months ago

Hi @smorzechowski,

Firstly, thank you so much for using ClipKIT and for writing about your issue to us. We really appreciate community members that help improve the overall experience and quality of ClipKIT.

Our apologies for the confusion regarding gap characters.

The help message was insufficiently clear - sorry about that. Amino acid gaps Xx-?* and nucleotide gaps are XxNn-?*.

Regarding the error message when specifying gaps as -?*XxNn, the parser is getting confused when the gap characters start with - because it is detecting an argument. That is why the error message is that an argument is not being detected. This clarification has been updated in the documentation and help message.

You can download the latest ClipKIT release, version 2.2.4, using pip3 install clipkit -U.

Also, to check that the gap characters are being interpreted correctly, ClipKIT prints all user arguments:

-------------
| Arguments |
-------------
Input file: example_file.fa (format: fasta)
Output file: example_file.fa.clipkit (format: fasta)
Sequence type: Nucleotides
Gaps threshold: 1
Gap characters: ['?', '*', 'X', 'x', 'N', 'n']
Trimming mode: smart-gap
Create complementary output: False
Process as codons: False
Create log file: False

(see the line: Gap characters).

Thank you again for your message!

best,

Jacob