eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Coverage and identity in BLAST alignment #75

Closed ireneortega closed 3 years ago

ireneortega commented 3 years ago

I want to add coverage and identity filters to infer homology between genes in strains. I am working with same species genomes and I need high coverage and identity to consider a gen as present in each genome. However, when using high values for both parameters (97 % identity and 99 % coverage) I only get core genes, how about shell genes?

I want to get a presence/absence pangenome matrix using 97 % identity and 99 % coverage to identify unique genes in strains and gene clusters.

brunocontrerasmoreira commented 3 years ago

Hi @ireneortega , can you please share with us: i) what kind of input data you have (gbk files, fna files, etc) ii) which command did you use: get_homologues.pl or get_homologues-est.pl

By default only core genes are produced, you need to add -t 0 to get accessory clusters, of any size

Bruno

ireneortega commented 3 years ago

i) I am using gbk files annotated with Prokka ii) I used get_homologues.pl

I added the -t 0 option and now the pangenome is constructed as I want. Many thanks!

Just last doubt: I got an empty 1234_proteinannotation.faa file and its corresponding line in the pangenome_matrix_genes_t0.tab is empty/-/-, so I can not extract any info from what cluster. What could happened? Perhaps I should delete that cluster.

Thanks for your help!

brunocontrerasmoreira commented 3 years ago

That's a probably a gene without sequence in the gbk file, please confirm

ireneortega commented 3 years ago

I couldn't prove it. I count the number of times the text /translation= is repeated in the gbk file and it is equal to the counts printed by the tool when running ./get_homologues.pl in line 10. And in the gbk.faa file generated by the tool, there is no empty line. I don't know if there is another way to check it :(

brunocontrerasmoreira commented 3 years ago

The culprit is something called ~ proteinannotation , which might well be in the prokka output and perhaps get_homologues should not be parsing it? Let us know if you find out or please share the gbk file so I can test it here, Bruno