dib-lab / elvers

(formerly eelpond) an automated RNA-Seq workflow system
https://dib-lab.github.io/elvers/
Other
28 stars 3 forks source link

how should we generate the gene to transcript map? #79

Open ctb opened 5 years ago

ctb commented 5 years ago

@ljcohen has some code, maybe... :)

johnsolk commented 5 years ago
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")

https://github.com/dib-lab/dammit/issues/131

bluegenes commented 5 years ago

Now that dammit has our --no-rename, flag implemented, might be worth using that to stop dammit from renaming trinity transcripts, so that we maintain trinity "gene" and "transcript" designations.

Code above still required if we want to instead build gene-transcript relationships based on annotations.

ctb commented 5 years ago

(is new version of dammit up on bioconda?)

bluegenes commented 5 years ago

Not sure - I'm actually installing dammit from master https://github.com/dib-lab/eelpond/blob/master/rules/dammit/dammit-env.yaml

(so the feature is implemented). Camille fixed a downstream issue with my implementation in ~ dec, so I think it should be fully working now.

johnsolk commented 5 years ago

talking with @mandyfrazier wanting to revive discussion about incorporating a script to make the tx2gene file from dammit gff3 output. This requires making decisions about which annotation to pick per contig, e.g. sorting by e-value, length of annotation, choosing custom database annotations. Word cloud? (as suggested by @ctb)

Example below, one transcript which should be annotated COX5B. How to programmatically choose that?

Transcript_100014   HMMER   protein_hmm_match   76  273 3.300000e-17    .   .   ID=homology:234985;Name=COX5B;Target=COX5B 19 86 +;Note=Cytochrome c oxidase subunit Vb;accuracy=0.86;env_coords=28 285;Dbxref="Pfam:PF01215.15"
Transcript_100014   LAST    translated_nucleotide_match 92  367 3.200000e-60    -   .   ID=homology:390878;Name=ENSXMAP00000000371;Target=ENSXMAP00000000371 0 90 +;database=OrthoDB
Transcript_100014   LAST    translated_nucleotide_match 92  367 5.700000e-36    -   .   ID=homology:511790;Name=sp|P10606|COX5B_HUMAN;Target=sp|P10606|COX5B_HUMAN 0 93 +;database=sprot
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    183 224 2.000000e-19    +   .   ID=homology:748371;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 86 128 +;database=kfish2rae5g.pub.aa
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    232 297 2.100000e-36    +   .   ID=homology:748366;Name=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein;;Target=Funhe2EKm029180t6 aalen=116,66%,complete; clen=530; offs=138-488; oid=Funhe2Exx11m009875t6,Funhe2E6bm045748t2; organism=Fundulus_heteroclitus; type=protein; isoform=6; Name=Uncharacterized protein; 45 111 +;database=kfish2rae5g.pub.aa
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    24  118 7.500000e-56    +   .   ID=homology:748378;Name=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t4 aalen=102,99%,partial; clen=309; offs=2-307; oid=Funhe2Exx11m009875t4,Fungr1EG3m027840t2; organism=Fundulus_heteroclitus; type=protein; isoform=4; Name=Cytochrome c oxidase subunit 5B, mitochondrial (73%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 3 98 +;database=kfish2rae5g.pub.aa
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    24  122 1.000000e-60    +   .   ID=homology:748362;Name=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t3 aalen=131,99%,partial; clen=395; offs=3-395; oid=Funhe2Exx11m009875t3,Fungr1EG3m027840t1; organism=Fundulus_heteroclitus; type=protein; isoform=3; Name=Cytochrome c oxidase subunit 5B, mitochondrial (98%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 99 +;database=kfish2rae5g.pub.aa
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    31  122 2.300000e-56    +   .   ID=homology:1228514;Name=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138];Target=ENSFHEP00000007063.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811366.1:97972:103002:-1 gene:ENSFHEG00000000204.1 transcript:ENSFHET00000004439.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:cytochrome c oxidase subunit 5B, mitochondrial [Source:NCBI gene;Acc:105931138] 0 92 +;database=Fundulus_heteroclitus.Fundulus_heteroclitus-3.0.2.pep.all.fa
Transcript_100014   shmlast.LAST    conditional_reciprocal_best_LAST    31  122 7.500000e-56    +   .   ID=homology:748370;Name=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,;;Target=Funhe2EKm029180t1 aalen=128,75%,complete; clen=513; offs=34-420; oid=Funhe2Exx11m009875t2,Funhe2E6bm045748t1; organism=Fundulus_heteroclitus; type=protein; isoform=1; Name=Cytochrome c oxidase subunit 5B, mitochondrial (100%P); genegroup=FISH11G_G1808.s3; Dbxref=CDD:238464,TrEMBL:UniRef50_P10606,TrEMBL:COX5B_HUMAN,; 0 92 +;database=kfish2rae5g.pub.aa
Transcript_100014   transdecoder    CDS 632 928 .   -   0   ID=cds.Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014   transdecoder    exon    1   1019    .   -   .   ID=Gene.220552::Transcript_100014::g.220552::m.220552.exon1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014   transdecoder    five_prime_UTR  929 1019    .   -   .   ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr5p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
Transcript_100014   transdecoder    gene    1   1019    .   -   .   ID=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014   transdecoder    mRNA    1   1019    .   -   .   ID=Gene.220552::Transcript_100014::g.220552::m.220552;Parent=Gene.220552::Transcript_100014::g.220552;Name=ORF%20type%3Acomplete%20len%3A99%20%28-%29
Transcript_100014   transdecoder    three_prime_UTR 1   631 .   -   .   ID=Gene.220552::Transcript_100014::g.220552::m.220552.utr3p1;Parent=Gene.220552::Transcript_100014::g.220552::m.220552
johnsolk commented 5 years ago

Note that some custom amino acid fasta files might have semicolons. These need to be fixed before using in dammit because of this issue with dammit.