benhid / pyMSA

Scoring multiple sequence alignments with Python
MIT License
21 stars 10 forks source link

Simple CLI? #8

Open multimeric opened 3 years ago

multimeric commented 3 years ago

Hi, thanks for this wonderful library!

I'm just wondering that since we have utilities like read_fasta_file_as_list_of_pairs, and also run_all_scores which runs a comprehensive evaluation, we could write a CLI that calls these these on an input fasta alignment (and initially not support other alignment formats for simplicity), and maybe make the scores configurable via flags. It seems that there was a benchmark.py that did this (it's alluded to in the PDF), but it must have been deleted.

This would offer a very useful and easy method of evaluating MSAs, which as far as I can tell is a gap in the ecosystem at the moment.

a1ultima commented 3 years ago

@multimeric I have also noticed the missing benchmark.py and our team's current needs is indeed a vouch for your closing statement. We would otherwise have to build our own CLI that would recycle the utilities you mentioned.

Fruther to this, it would be nice to have a DNA-only variant of pyMSA's scoring stack. e.g. for Sum-of-pairs, where a DNA substitution matrix could be passed as input.

We're trying to score various MSA approaches to decide on the right approach for our pipeline at the moment. So these features are really a gap in the ecosystem that is in need of closing.

a1ultima commented 3 years ago

@multimeric , I have now gotten a local repo which utilises read_fasta_file_as_list_of_pairs, by importing this into thescore_alignments.py script to accept someFile.fasta as input rather than a hard-coded python list. It's still not a full CLI but Let me know if this would be something you want.

a1ultima commented 3 years ago

@multimeric , I have now gotten the score_alignments.py to work for purely DNA sequences so sum-of-pairs score can be calculated by providing a DNA substitution matrix as input file argument. e.g. DNA85.txt:


# Match score: 1.766, mismatch score: -2.322 bits
# Expected score: -1.30, entropy: 1.15 bits

      A     T     G     C 
A   1.77 -2.32 -2.32 -2.32
T  -2.32  1.77 -2.32 -2.32
G  -2.32 -2.32  1.77 -2.32
C  -2.32 -2.32 -2.32  1.77

These matrices can be created by: https://bioinformaticshome.com/online_software/create_DNA_matrix/createDNAmatrix.html

multimeric commented 3 years ago

Yeah I think I made a simple script by combining run_all_scores with read_fasta_file_as_list_of_pairs: https://github.com/benhid/pyMSA/blob/570d902bbc214a30f18b93adf735c46836611bf5/examples/runner.py#L5-L42