GDKO / AvP

Automatic evaluation of HGTs
GNU General Public License v3.0
18 stars 2 forks source link

AvP AI index different from Alienness #8

Closed lyy005 closed 1 year ago

lyy005 commented 1 year ago

Hi there, Thank you for making this amazing tool to detect HGTs. I have a quick question about the AI index calculated from AvP. I'm looking for horizontally transferred genes in aphids. Potential donors of interest are bacteria, fungi, and viruses. So I set the groups.yaml as follows. Ingroups are metazoa. And I wanted to exclude all the hits to arthropods.

--- Ingroup: 33208: Metazoa

EGP: 6656: Arthropoda

The resulting AI from AvP was 460.51701859880916, indicating it's likely a horizontally transferred gene.

Then I used the same input BLAST result and same setting for Alienness as in AvP, i.e. ingroups are metazoa. And I wanted to exclude all the hits to arthropods. However, based on Alienness results, the same gene is not an HGT.

Here're the parameters I used for Alienness.

Parameters -Taxon group of interest: 33208-Metazoa; -Taxon group(s) to exclude: 6656-Arthropoda; -Taxon group(s) to classify: no taxa

I was wondering why the AI from Alienness and AvP are different? Did I set the AvP groups.yaml correctly?

Thank you for any suggestions!

YY

GDKO commented 1 year ago

Hi YY,

I am glad the tool is useful for your research. An AI that high indicates there are no hits from ingroup sequences. Both tools use the same AI calculation defined in Gladyshev et al. and you have specified the parameters correctly in both programs so the AI calculations should be identical. However, Alienness defines queries with AI > 15 and percentage of identity > 70 as likely contamination, so your query may be present in that category and not in the likely_hgt.

Can you check the output files of Alienness whether that is the case?

Cheers, Georgios

lyy005 commented 1 year ago

Dear Georgios, Thank you for the quick response. I checked the Alienness output. However, I didn't see the input sequence. I think Alienness does not include the sequence if the sequence is not classified as one of the three categories ("likely hgt", "possible hgt", or "likely contamination")? Attached are the diamond outputs I used for AvP and Alienness just in case it would be helpful.

The protein "XP_016660728.1" was an aphid protein that is not supposed to be a horizontally transferred gene. And I mapped the protein to NR database using Diamond with the following parameter:

To make the input for Alienness, I used this command: diamond blastp -d nr_v20220917.dmnd -q XP_016660728.1.fasta -o XP_016660728.1.out --evalue 1e-3 --threads 40 --mid-sensitive -k 500 --outfmt 6

Here is the input for Alienness: Alienness.XP_016660728.1.diamond.out.gz

To make the input for AvP, I used this command: (same as above except for the "staxids" parameter at the end) diamond blastp -d nr_v20220917.dmnd -q XP_016660728.1.fasta -o XP_016660728.1.out --evalue 1e-3 --threads 40 --mid-sensitive -k 500 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids

Here is the input for AvP: AvP.XP_016660728.1.diamond.out.gz

For AI calculation in AvP, I used this command: calculate_ai.py -i AvP.XP_016660728.1.diamond.out -x groups.yaml

Thank you!

YY

GDKO commented 1 year ago

Hi YY,

I have now updated the script to ignore hits that have no taxid information due to being removed from ncbi.

Please download the new version.

Cheers, Georgios