eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

annotate_cluster.pl excludes indentical sequences #87

Closed carolynzy closed 2 years ago

carolynzy commented 2 years ago

Hi, I'm using annotate_cluster.pl on my clusters while I noticed a strange thing. Take cluster 1228009 for example, there are 421 sequences in this cluster. Every sequence has 64 aa. They are almost identical. However, when using annotate_cluster.pl, only 251 sequences would be aligned. I read the manual which said some short fragments could be left out due to not aligned to the longest sequence, which is not my case I think. log.annotate_cluster.1228009.txt 1228009_ubiquitin-like_prote.txt

I attached the fasta file as well as the log file. Would you please check this issue? Thank you!

P.S. I took a further look and found that it seems no matter how many sequences in the cluster, only a maximum of 251 sequences will be aligned despite the sequences are highly similar.

eead-csic-compbio commented 2 years ago

Hi @carolynzy , did you use option -c ?

carolynzy commented 2 years ago

No. I didn't use -c.

eead-csic-compbio commented 2 years ago

Will have a look later in the day

eead-csic-compbio commented 2 years ago

Hi @carolynzy , there were two issues here:

1) The code was taking only the default number of hits reported by BLASTP, now it takes as many as sequences in the cluster. See https://github.com/eead-csic-compbio/get_homologues/commit/561894a4809730b078a5ab49e5d9656df5914bce

2) The sequences in your sample cluster have redundant names, see:

perl -lne 'if(/^>(\S+)/){ print $1 }' 1228009_ubiquitin-like_prote.txt | sort -u |wc
355     355    4263

I have commited the changes, you should take care of sequence names on your side to resolve this limitation, Bruno

carolynzy commented 2 years ago

@eead-csic-compbio Thank you! I do have changed the name but don't know why I still uploaded the original version. Thank you very much!