Open aldendirks opened 8 months ago
It ended up working with just the mRNA lines of the GFF3 file. There was a -T1 appended to the IDs from Funannotate so I had to remove those then it worked. Thanks!
Hi,
get the same error:
$ ../../aux_scripts/hgt_local_score.py -f NANCO@999999009@3_mRNA.gff -a 2000bp_og_core.out_ai.out_NANCO -t fasttree_general_results_NANCO.txt -m 0
Traceback (most recent call last):
File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 169, in <module>
main()
File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 86, in main
gene_scaf = gene_location[gene]
~~~~~~~~~~~~~^^^^^^
KeyError: 'scaffold7841_size5308'
I have no clue how to fix this, names in all files are the same like scaffold7841_size5308
.
My gff file looks like this...
scaffold100001_size446 MetaEuk mRNA 133 432 134 - . ID=mRNA1;Parent=gene1
scaffold100004_size446 MetaEuk mRNA 6 446 140 + . ID=mRNA2;Parent=gene2
scaffold100006_size446 MetaEuk mRNA 191 442 74 - . ID=mRNA3;Parent=gene3
scaffold10000_size4538 MetaEuk mRNA 2369 3908 412 + . ID=mRNA4;Parent=gene4
scaffold100010_size446 MetaEuk mRNA 97 267 63 - . ID=mRNA5;Parent=gene5
scaffold100017_size446 MetaEuk mRNA 4 162 67 - . ID=mRNA6;Parent=gene6
scaffold100018_size446 MetaEuk mRNA 282 431 59 - . ID=mRNA7;Parent=gene7
scaffold10001_size4538 MetaEuk mRNA 3920 4036 54 - . ID=mRNA8;Parent=gene8
scaffold100023_size446 MetaEuk mRNA 2 160 78 + . ID=mRNA9;Parent=gene9
scaffold100029_size446 MetaEuk mRNA 3 386 94 - . ID=mRNA10;Parent=gene10
scaffold10002_size4538 MetaEuk mRNA 618 1145 143 + . ID=mRNA11;Parent=gene11
scaffold10002_size4538 MetaEuk mRNA 2501 3688 456 - . ID=mRNA12;Parent=gene12
fasttree_general_results.txt
like this...
NO 2000bp_og_core/mafftgroups/gp1.fa /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp1.fa.fasttree scaffold7428_size4981
HGT 2000bp_og_core/mafftgroups/gp17.fa /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp17.fa.fasttree scaffold7841_size5308
COMPLEX 2000bp_og_core/mafftgroups/gp33.fa /scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp33.fa.fasttree scaffold5080_size6865
and *out_ai.out
file like this...
query name donor ingroup AI HGTindex query hits number AHS outg_pct
scaffold7428_size4981 Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930 Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045 0.0 -115.0 500 214760.3805765278 93
scaffold15819_size3298 Uniref90|UniRef90_A0A918QQU7:11:65.7:0.0:937 Uniref90|UniRef90_A0AA38IBE4:2:72.6:0.0:1039 0.0 -102.0 500 216133.42592125913 93
scaffold7841_size5308 Uniref90|UniRef90_A0A919J7Y8:1:66.0:9.64e-45:154 Uniref90|UniRef90_A0A914DH25:302:56.6:2.39e-33:129 26.236393373249513 25.0 500 18459.662785971963 99
scaffold7428_size4981 Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930 Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045 0.0 -115.0 500 214718.58133127296 93
scaffold12196_size3834 Uniref90|UniRef90_A0A932I9T7:207:40.1:5.21e-100:317 Uniref90|UniRef90_UPI000719BA09:4:49.0:3.26e-123:380 -53.42830979924969 -63.0 500 -30065.837801365476 2
scaffold9970_size4547 :::: Uniref90|UniRef90_UPI0008114012:4:43.3:1.44e-138:420 -317.3920997195904 -420.0 500 -17870.776102456446 0
Cheers Bastian
Hi @bheimbu,
Your scaffold names are also your gene names. So there is some issue there with how you generated your input.
Cheers, Georgios
Hi @GDKO,
please can you check my attach files, if there are correct? However, I still get the same error (KeyError: 'A7LXT0_NANCO_999999009_3_scaffold7841_size5308'
), so obviously not.
2000bp_og_core.out_ai.out_NANCO.txt fasttree_general_results_NANCO.txt NANCO@999999009@3_mRNA.gff.txt
I didn't do the annotation, got the annotations from someone else.
Cheers Bastian
Hi @bheimbu,
Your gene names in 2000bp_og_core.out_ai.out_NANCO.txt
and fasttree_general_results_NANCO.txt
are the same. However, your gene names in the gff file are not the same as the ones you used for running AvP. In the gff file they are named mRNA1, mRNA2 ... etc
.
Cheers, Georgios
Hi @GDKO
that is, I should rerun the annotation or how to proceed?
To give you a bit more background info: I've used fDOG to search for orthologs of bacterial and fungal origin in oribatid genomes. The found orthologs were then used to search for HGTs with avp. That's why my names are so differently between gff and other files.
Cheers Bastian
Hi @bheimbu,
The hgt_local_score was designed to check the locality of the genes detected as potential HGTs from AvP in the genome assembly. The idea is that if a gene is an HGT then it will be in a region surrounded by non-HGT genes. There can be a case of course that a bigger region was inserted into the genome. So this score can identify regions which are either contamination in the genome assembly or HGT-rich regions.
I think you have misunderstood the purpose of this score. To run this program you need the genes and the gff of one assembly. For example, you have data from one oribatid genome. You run the AvP pipeline to identify potential HGT candidates from your gene set and then you run hgt_local_score to see in which regions in your genome those genes fall.
Cheers, Georgios
Hi @GDKO
I got the idea right, trust me. But I only want to know whether certain genes (found by fDOG) represent true HGTs or not. It's actually just a matter of how my data (i.e. gff files and output from fDOG) assigns names, right? So what I have to do is to manipulate my data to get it in the right format.
My question would be: Do you have an example data set to see how the files should look like?
Cheers Bastian
Your software has worked really well for me! I'm now trying to run the HGT local score script but am getting errors parsing the GFF3 file.
I wasn't sure if the GFF3 file should be the full annotations (gene, mRNA, exon, and CDS) or just the lines that are annotated as "gene" or just those annotated as "mRNA". I've tried all the different combos and it gives the same error. When I grep that ID it is found as a match in the file. My GFF3 has 9 columns in the same format specified so I'm not sure why it can't find the right line.