GDKO / AvP

Automatic evaluation of HGTs
GNU General Public License v3.0
18 stars 2 forks source link

Error parsing GFF3 file for AvP hgt local score analysis #13

Open aldendirks opened 8 months ago

aldendirks commented 8 months ago

Your software has worked really well for me! I'm now trying to run the HGT local score script but am getting errors parsing the GFF3 file.

Traceback (most recent call last):
  File "/nfs/turbo/lsa-tyjames/mycology/Alden/software/AvP/aux_scripts/hgt_local_score.py", line 169, in <module>
    main()
  File "/nfs/turbo/lsa-tyjames/mycology/Alden/software/AvP/aux_scripts/hgt_local_score.py", line 86, in main
    gene_scaf = gene_location[gene]
                ~~~~~~~~~~~~~^^^^^^
KeyError: 'Hydnotrya_cerebriformis_MICH67763_scaffold_3601_FUN_011721'

I wasn't sure if the GFF3 file should be the full annotations (gene, mRNA, exon, and CDS) or just the lines that are annotated as "gene" or just those annotated as "mRNA". I've tried all the different combos and it gives the same error. When I grep that ID it is found as a match in the file. My GFF3 has 9 columns in the same format specified so I'm not sure why it can't find the right line.

aldendirks commented 8 months ago

It ended up working with just the mRNA lines of the GFF3 file. There was a -T1 appended to the IDs from Funannotate so I had to remove those then it worked. Thanks!

bheimbu commented 3 months ago

Hi,

get the same error:

$ ../../aux_scripts/hgt_local_score.py  -f NANCO@999999009@3_mRNA.gff -a 2000bp_og_core.out_ai.out_NANCO -t fasttree_general_results_NANCO.txt -m 0
Traceback (most recent call last):
  File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 169, in <module>
    main()
  File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 86, in main
    gene_scaf = gene_location[gene]
                ~~~~~~~~~~~~~^^^^^^
KeyError: 'scaffold7841_size5308'

I have no clue how to fix this, names in all files are the same like scaffold7841_size5308.

My gff file looks like this...

scaffold100001_size446  MetaEuk mRNA    133 432 134 -   .   ID=mRNA1;Parent=gene1
scaffold100004_size446  MetaEuk mRNA    6   446 140 +   .   ID=mRNA2;Parent=gene2
scaffold100006_size446  MetaEuk mRNA    191 442 74  -   .   ID=mRNA3;Parent=gene3
scaffold10000_size4538  MetaEuk mRNA    2369    3908    412 +   .   ID=mRNA4;Parent=gene4
scaffold100010_size446  MetaEuk mRNA    97  267 63  -   .   ID=mRNA5;Parent=gene5
scaffold100017_size446  MetaEuk mRNA    4   162 67  -   .   ID=mRNA6;Parent=gene6
scaffold100018_size446  MetaEuk mRNA    282 431 59  -   .   ID=mRNA7;Parent=gene7
scaffold10001_size4538  MetaEuk mRNA    3920    4036    54  -   .   ID=mRNA8;Parent=gene8
scaffold100023_size446  MetaEuk mRNA    2   160 78  +   .   ID=mRNA9;Parent=gene9
scaffold100029_size446  MetaEuk mRNA    3   386 94  -   .   ID=mRNA10;Parent=gene10
scaffold10002_size4538  MetaEuk mRNA    618 1145    143 +   .   ID=mRNA11;Parent=gene11
scaffold10002_size4538  MetaEuk mRNA    2501    3688    456 -   .   ID=mRNA12;Parent=gene12

fasttree_general_results.txt like this...

NO  2000bp_og_core/mafftgroups/gp1.fa   /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp1.fa.fasttree   scaffold7428_size4981
HGT 2000bp_og_core/mafftgroups/gp17.fa  /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp17.fa.fasttree  scaffold7841_size5308
COMPLEX 2000bp_og_core/mafftgroups/gp33.fa  /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp33.fa.fasttree  scaffold5080_size6865

and *out_ai.out file like this...

query name  donor   ingroup AI  HGTindex    query hits number   AHS outg_pct
scaffold7428_size4981   Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930 Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045    0.0 -115.0  500 214760.3805765278   93
scaffold15819_size3298  Uniref90|UniRef90_A0A918QQU7:11:65.7:0.0:937    Uniref90|UniRef90_A0AA38IBE4:2:72.6:0.0:1039    0.0 -102.0  500 216133.42592125913  93
scaffold7841_size5308   Uniref90|UniRef90_A0A919J7Y8:1:66.0:9.64e-45:154    Uniref90|UniRef90_A0A914DH25:302:56.6:2.39e-33:129  26.236393373249513  25.0    500 18459.662785971963  99
scaffold7428_size4981   Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930 Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045    0.0 -115.0  500 214718.58133127296  93
scaffold12196_size3834  Uniref90|UniRef90_A0A932I9T7:207:40.1:5.21e-100:317 Uniref90|UniRef90_UPI000719BA09:4:49.0:3.26e-123:380    -53.42830979924969  -63.0   500 -30065.837801365476 2
scaffold9970_size4547   ::::    Uniref90|UniRef90_UPI0008114012:4:43.3:1.44e-138:420    -317.3920997195904  -420.0  500 -17870.776102456446 0

Cheers Bastian

GDKO commented 3 months ago

Hi @bheimbu,

Your scaffold names are also your gene names. So there is some issue there with how you generated your input.

Cheers, Georgios

bheimbu commented 3 months ago

Hi @GDKO,

please can you check my attach files, if there are correct? However, I still get the same error (KeyError: 'A7LXT0_NANCO_999999009_3_scaffold7841_size5308' ), so obviously not.

2000bp_og_core.out_ai.out_NANCO.txt fasttree_general_results_NANCO.txt NANCO@999999009@3_mRNA.gff.txt

I didn't do the annotation, got the annotations from someone else.

Cheers Bastian

GDKO commented 3 months ago

Hi @bheimbu,

Your gene names in 2000bp_og_core.out_ai.out_NANCO.txt and fasttree_general_results_NANCO.txt are the same. However, your gene names in the gff file are not the same as the ones you used for running AvP. In the gff file they are named mRNA1, mRNA2 ... etc.

Cheers, Georgios

bheimbu commented 3 months ago

Hi @GDKO

that is, I should rerun the annotation or how to proceed?

To give you a bit more background info: I've used fDOG to search for orthologs of bacterial and fungal origin in oribatid genomes. The found orthologs were then used to search for HGTs with avp. That's why my names are so differently between gff and other files.

Cheers Bastian

GDKO commented 2 months ago

Hi @bheimbu,

The hgt_local_score was designed to check the locality of the genes detected as potential HGTs from AvP in the genome assembly. The idea is that if a gene is an HGT then it will be in a region surrounded by non-HGT genes. There can be a case of course that a bigger region was inserted into the genome. So this score can identify regions which are either contamination in the genome assembly or HGT-rich regions.

I think you have misunderstood the purpose of this score. To run this program you need the genes and the gff of one assembly. For example, you have data from one oribatid genome. You run the AvP pipeline to identify potential HGT candidates from your gene set and then you run hgt_local_score to see in which regions in your genome those genes fall.

Cheers, Georgios

bheimbu commented 2 months ago

Hi @GDKO

I got the idea right, trust me. But I only want to know whether certain genes (found by fDOG) represent true HGTs or not. It's actually just a matter of how my data (i.e. gff files and output from fDOG) assigns names, right? So what I have to do is to manipulate my data to get it in the right format.

My question would be: Do you have an example data set to see how the files should look like?

Cheers Bastian

GDKO commented 2 months ago

See here https://github.com/GDKO/AvP/wiki/tutorial