endixk / ezaai

EzAAI - High Throughput Prokaryotic AAI Calculator
http://leb.snu.ac.kr/ezaai
GNU General Public License v3.0
32 stars 2 forks source link

How about eukaryotic proteins #18

Open neptuneyt opened 5 months ago

neptuneyt commented 5 months ago

Dear developer,

I'm curious if ezaai is also suitable for AAI calculations between eukaryotic proteins, please note that proteins and not genomes are entered here.

Looking forward your reply. Thanks a lot.

endixk commented 5 months ago

Dear @neptuneyt ,

The prokayotic constraint of the program only applies to the extract module which uses prodigal.

Any other functionalities, including AAI calculations, should work with eukaryotic proteomes.

Just use convert module to your FASTA file with eukaryotic proteins to produce a database compatible with any subsequent processes.

Hope this helps!

neptuneyt commented 5 months ago

Thank you for such a prompt reply, I'II try it.

neptuneyt commented 5 months ago

Following your instructions very fortunately I got the results, but I encountered another problem, i.e. how to get non-redundant AAI between N proteomes, e.g. there are three proteomes A.faa, B.faa, and C.faa, and in the end only AB, AC, and BC should be computed, but using the following commands I ended up with 3*3=9 pairs, and there are 6 redundant results (AB=BA)

$ ls protein_db
A.faa.db B.faa.db C.faa.db
$ EzAAI calculate -i protein_db -j protein_db -t 10 -o ezaai_q3_r3.tsv
``Following your instructions very fortunately I got the results, but I encountered another problem, i.e. how to get non-redundant AAI between n proteomes, e.g. there are three proteomes A.faa, B.faa, and C.faa, and in the end only AB, AC, and BC should be computed, but using the following commands I ended up with 3*3=9 pairs, and there are 6 redundant results (AB=BA)
```bash
$ ls protein_db
A.faa.db B.faa.db C.faa.db
$ EzAAI calculate -i protein_db -j protein_db -t 10 -o ezaai_q3_r3.tsv
``Following your instructions very fortunately I got the results, but I encountered another problem, i.e. how to get non-redundant AAI between n proteomes, e.g. there are three proteomes A.faa, B.faa, and C.faa, and in the end only AB, AC, and BC should be computed, but using the following commands I ended up with 3*3=9 pairs, and there are 6 redundant results (AB=BA)
```bash
$ ls protein_db
A.faa.db B.faa.db C.faa.db
$ EzAAI calculate -i protein_db -j protein_db -t 10 -o ezaai_q3_r3.tsv

If the N were smaller it wouldn't consume much time, but I'm afraid there are thousands of them, so this comparison will be quite time consuming. Looking forward to your reply if there is a good solution!

endixk commented 5 months ago

Thank you for pointing this out.

This is actually a result of my lazy implementation. Current code only implements comparison between two distinct set of proteomes, therefore, has no ability to detect redundancy even if two identical sets are given as an input.

I assume I can provide something like -self flag that exclusively indicates that this comparison is against itself.