GDKO / AvP

Automatic evaluation of HGTs
GNU General Public License v3.0
18 stars 2 forks source link

Clarification on HGT call #16

Closed bshrestha0 closed 6 months ago

bshrestha0 commented 7 months ago

Hi,

I am interested in identifying HGT events in Dinophyceae, specifically non-eukaryotic gene transfers in Dinophyceae.

So, I set up groups.yaml as-


Ingroup: 2759: Eukaryota EGP: 2864: Dinophyceae

While looking at the putative HGTs identified by AvP, I wasn't sure how AvP tagged a protein as HGT. I have included a screehshot of the tree generated by AvP. NR gp4504 fa

As you can see in the tree , query protein sequence "50_DN47782_c0_g1_i1" is in a clade with other Dinophyceae members and the sister taxon to that clade is another eukaryote (YP_009033831), which is a streptophyte algae. All other members are bacterial species. So, since a homolog of that gene is already present in other eukaryote, how can it be a HGT specific to Dinophyceae. I am little confused here.

Best regards, Bikash

GDKO commented 6 months ago

Hi @bshrestha0,

The tree is rooted at midpoint before running the algorithm. Check the same tree in the nexus folder to see the rooted tree.

bshrestha0 commented 6 months ago

I see, the HGT call based on the rooted tree makes sense. I was looking at the wrong tree file. Thanks for pointing it out!

Looks like that blastp resulted with multiple eukaryotic hits including Dinophyceae, Streptophyta, Chlorophyta, and Haptophyta (mulitple hits) but during the AvP prepare step only Dinophyceae members and a Steptophyta was selected. For example here's the partial output of diamond blastp- Screenshot 2023-12-17 at 7 29 30 PM and the output of calculate_ai.py- Screenshot 2023-12-17 at 7 31 31 PM Both Haptophyta and Chlorophyta were excluded in the processed fasta file, perhaps they failed to meet the criteria or removed during clustering. I was wondering if there's a way to include these sequences in the fasta group, maybe with changing some parameters. I am using the default parameters in config file- Screenshot 2023-12-17 at 7 44 21 PM

Thanks!

GDKO commented 6 months ago

You should check the cutoffextend parameter. This will keep until n=20 hits from the blast output following the first ingroup hit. In your example it will keep until the next 20 hits after YP_009033831.1

bshrestha0 commented 6 months ago

I have 16 eukaryote hits out of total 302 hits, and rest all are bacterial hits. With n=20 shouldn't all 16 eukaryotes include in the fasta file? Or when you said next 20 hits, do you mean 20 subsequent hits right after the first eukaryotic hit regardless of ingroup or outgroup?

GDKO commented 6 months ago

Sorry for the confusion, the algorithm will keep

20 subsequent hits right after the first ingroup hit regardless of ingroup or outgroup

bshrestha0 commented 6 months ago

No worries, I will tweak parameter cutoffextend then and see how it goes.

Thank you for helping me out!