cole-trapnell-lab / cufflinks

Boost Software License 1.0
310 stars 116 forks source link

Issue loading reference annotation #98

Closed NBaileyNCL closed 6 years ago

NBaileyNCL commented 6 years ago

Hi all,

I wondered if you could give me some help with an issue I'm having using Cufflinks to assemble my aligned RNAseq data. I ran cufflinks with the command:

cufflinks -g TgalRNAseqAnalysis2/SequenceData/TgalA1GenomeAnnotation2.gtf TgalRNAseqAnalysis2/STARalignment/TgalSTARalignmentAligned.out.sam

and received an error when trying to load the reference annotation file:

Loading reference annotation. Segmentation fault (core dumped)

I tried 2 different methods of formatting the annotations from the embl genome file that I have, seqret and biopython GFF parser, but both gave me the same issue. They're too big to send, but here's the head of the two gff files I've tried:

from seqret:

gff-version 2.0

date 2017-12-08

Type DNA NODE_10016

NODE_10016 EMBL source 1 345 . + . Sequence "NODE_10016.1" ; organism "Trichomonas gallinae" ; mol_type "genomic DNA"

gff-version 2.0

date 2017-12-08

Type DNA NODE_10017

NODE_10017 EMBL source 1 343 . + . Sequence "NODE_10017.1" ; organism "Trichomonas gallinae" ; mol_type "genomic DNA"

gff-version 2.0

date 2017-12-08

Type DNA NODE_1001

NODE_1001 EMBL source 1 616 . + . Sequence "NODE_1001.1" ; organism "Trichomonas gallinae" ; mol_type "genomic DNA"

gff-version 2.0

date 2017-12-08

Type DNA NODE_10026

NODE_10026 EMBL source 1 2160 . + . Sequence "NODE_10026.1" ; organism "Trichomonas gallinae" ; mol_type "genomic DNA" NODE_10026 EMBL CDS 1 1351 . - . Sequence "NODE_10026.2" ; FeatFlags "0x1" ; locus_tag "TGA_000005000.1" ; product "hypothetical protein, conserved" ; gene "TGA_000005000" ; translation "MLPVLFTKAISMQFTSPKYFVKQTPLFEYPITNDEGKKFNLTKDGQGFFVGYRVHYDNGTLSPVQLMDKTTVNYEDIQIAFSSKKSGSQVNIQFRVTPRGYFPRKVDLGVFYVPNFNDKDNGPIVPVDEKADYNRGYIISSKDSYNYTLFLRNVGLYPNVDTIYIKDMGKASSTDPKFYPFFTNEINTRTSSKTVIAFSWLNQELSFDTPNIFEFTLAAGVVTNTPPRLFDLSNVNPGGHQPNEKITFQFKAVDFDQSDKITIKCRLRSSSARSEYTNFENSTTTTPQERNVILTISDYQIGPTGALGDCHAEDASGSSSNQIRITITTNQLAKLKITSDIKPKYYKDEKIHVKGTINNDDDGVVLYYRINNGKIENPQISEYTATSFDFKVKIDPRIPTRKNYTLYIWAEDEYGLLSEPVAIPFYLRAPENPQLLSVFSSDYAYERGK" ; transl_table 1 ; note "ortholog: TGA_000086100.1 TGA_000086100.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000117100.1 TGA_000117100.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000142300.1 TGA_000142300.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000308300.1 TGA_000308300.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000476800.1 TGA_000476800.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000516800.1 TGA_000516800.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000516900.1 TGA_000516900.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000747600.1 TGA_000747600.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_000781900.1 TGA_000781900.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001080200.1 TGA_001080200.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001121200.1 TGA_001121200.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001175500.1 TGA_001175500.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001300400.1 TGA_001300400.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001300500.1 TGA_001300500.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001474100.1 TGA_001474100.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001474200.1 TGA_001474200.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001474300.1 TGA_001474300.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001484400.1 TGA_001484400.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001731300.1 TGA_001731300.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001791200.1 TGA_001791200.1;program=OrthoMCL;rank=0" ; note "ortholog: TGA_001885600.1 TGA_001885600.1;program=OrthoMCL;rank=0" ; note "colour: 10"

gff-version 2.0

date 2017-12-08

Type DNA NODE_10027

from bipython:

gff-version 3

sequence-region NODE_10016.1 1 345

NODE_10016.1 annotation remark 1 345 . . . accessions=NODE_10016;data_file_division=UNC;keywords=;molecule_type=genomic DNA;organism=Trichomonas gallinae;references=authors: Authors%0Atitle: Title%3B%0Ajournal: Unpublished.%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;topology=linear NODE_10016.1 feature source 1 345 . + . mol_type=genomic DNA;organism=Trichomonas gallinae

sequence-region NODE_10017.1 1 343

NODE_10017.1 annotation remark 1 343 . . . accessions=NODE_10017;data_file_division=UNC;keywords=;molecule_type=genomic DNA;organism=Trichomonas gallinae;references=authors: Authors%0Atitle: Title%3B%0Ajournal: Unpublished.%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;topology=linear NODE_10017.1 feature source 1 343 . + . mol_type=genomic DNA;organism=Trichomonas gallinae

sequence-region NODE_1001.1 1 616

NODE_1001.1 annotation remark 1 616 . . . accessions=NODE_1001;data_file_division=UNC;keywords=;molecule_type=genomic DNA;organism=Trichomonas gallinae;references=authors: Authors%0Atitle: Title%3B%0Ajournal: Unpublished.%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;topology=linear NODE_1001.1 feature source 1 616 . + . mol_type=genomic DNA;organism=Trichomonas gallinae

sequence-region NODE_10026.1 1 2160

NODE_10026.1 annotation remark 1 2160 . . . accessions=NODE_10026;data_file_division=UNC;keywords=;molecule_type=genomic DNA;organism=Trichomonas gallinae;references=authors: Authors%0Atitle: Title%3B%0Ajournal: Unpublished.%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;topology=linear NODE_10026.1 feature source 1 2160 . + . mol_type=genomic DNA;organism=Trichomonas gallinae NODE_10026.1 feature CDS 1 1351 . - 0 colour=10;gene_id=TGA_000005000;transcript_id=TGA_000005000.1;ortholog=TGA_000086100.1 TGA_000086100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000117100.1 TGA_000117100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000142300.1 TGA_000142300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000308300.1 TGA_000308300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000476800.1 TGA_000476800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000516800.1 TGA_000516800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000516900.1 TGA_000516900.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000747600.1 TGA_000747600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000781900.1 TGA_000781900.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001080200.1 TGA_001080200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001121200.1 TGA_001121200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001175500.1 TGA_001175500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001300400.1 TGA_001300400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001300500.1 TGA_001300500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001474100.1 TGA_001474100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001474200.1 TGA_001474200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001474300.1 TGA_001474300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001484400.1 TGA_001484400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001731300.1 TGA_001731300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001791200.1 TGA_001791200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001885600.1 TGA_001885600.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=hypothetical protein%2C conserved;transl_table=1;translation=MLPVLFTKAISMQFTSPKYFVKQTPLFEYPITNDEGKKFNLTKDGQGFFVGYRVHYDNGTLSPVQLMDKTTVNYEDIQIAFSSKKSGSQVNIQFRVTPRGYFPRKVDLGVFYVPNFNDKDNGPIVPVDEKADYNRGYIISSKDSYNYTLFLRNVGLYPNVDTIYIKDMGKASSTDPKFYPFFTNEINTRTSSKTVIAFSWLNQELSFDTPNIFEFTLAAGVVTNTPPRLFDLSNVNPGGHQPNEKITFQFKAVDFDQSDKITIKCRLRSSSARSEYTNFENSTTTTPQERNVILTISDYQIGPTGALGDCHAEDASGSSSNQIRITITTNQLAKLKITSDIKPKYYKDEKIHVKGTINNDDDGVVLYYRINNGKIENPQISEYTATSFDFKVKIDPRIPTRKNYTLYIWAEDEYGLLSEPVAIPFYLRAPENPQLLSVFSSDYAYERGK

sequence-region NODE_10027.1 1 20991

NODE_10027.1 annotation remark 1 20991 . . . accessions=NODE_10027;data_file_division=UNC;keywords=;molecule_type=genomic DNA;organism=Trichomonas gallinae;references=authors: Authors%0Atitle: Title%3B%0Ajournal: Unpublished.%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;topology=linear NODE_10027.1 feature source 1 20991 . + . mol_type=genomic DNA;organism=Trichomonas gallinae NODE_10027.1 feature CDS 431 682 . + 0 colour=10;gene_id=TGA_000005100;transcript_id=TGA_000005100.1;ortholog=TGA_000068700.1 TGA_000068700.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000977300.1 TGA_000977300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001015900.1 TGA_001015900.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001140500.1 TGA_001140500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001140600.1 TGA_001140600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001258700.1 TGA_001258700.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_090750.1 TVAG_090750.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_090760.1 TVAG_090760.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_090770.1 TVAG_090770.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_281320.1 TVAG_281320.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_281330.1 TVAG_281330.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=hypothetical protein%2C conserved;transl_table=1;translation=MLIPEFEATLSTTRRVFTEKNIDHPFLDDGFSADAIAGTVIACISVVAIIAVCIWLFGFGEIMRCKKNKSEPGSGAKKEQDEM NODE_10027.1 feature CDS 1186 5865 . - 0 colour=10;gene_id=TGA_000005200;transcript_id=TGA_000005200.1;ortholog=TGA_000073900.1 TGA_000073900.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000180000.1 TGA_000180000.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000185600.1 TGA_000185600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000186400.1 TGA_000186400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000243600.1 TGA_000243600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000405600.1 TGA_000405600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000416500.1 TGA_000416500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000455800.1 TGA_000455800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000500400.1 TGA_000500400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000521700.1 TGA_000521700.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000549400.1 TGA_000549400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000549500.1 TGA_000549500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000582100.1 TGA_000582100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000599300.1 TGA_000599300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000665400.1 TGA_000665400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000672600.1 TGA_000672600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000822100.1 TGA_000822100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000842300.1 TGA_000842300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000847800.1 TGA_000847800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000870300.1 TGA_000870300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_000874800.1 TGA_000874800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001005400.1 TGA_001005400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001148200.1 TGA_001148200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001336100.1 TGA_001336100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001635100.1 TGA_001635100.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001818200.1 TGA_001818200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001908800.1 TGA_001908800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001939200.1 TGA_001939200.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001982800.1 TGA_001982800.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_002203600.1 TGA_002203600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_021840.1 TVAG_021840.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_036810.1 TVAG_036810.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_052920.1 TVAG_052920.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_052960.1 TVAG_052960.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_057580.1 TVAG_057580.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_057590.1 TVAG_057590.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_062940.1 TVAG_062940.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_072530.1 TVAG_072530.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_072540.1 TVAG_072540.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_082220.1 TVAG_082220.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_082230.1 TVAG_082230.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_092700.1 TVAG_092700.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_095300.1 TVAG_095300.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_098510.1 TVAG_098510.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_098520.1 TVAG_098520.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_098530.1 TVAG_098530.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_127600.1 TVAG_127600.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_127610.1 TVAG_127610.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_128850.1 TVAG_128850.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_138020.1 TVAG_138020.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_138040.1 TVAG_138040.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_145640.1 TVAG_145640.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_145650.1 TVAG_145650.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_155240.1 TVAG_155240.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_186390.1 TVAG_186390.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_194370.1 TVAG_194370.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_194440.1 TVAG_194440.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_219670.1 TVAG_219670.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_219680.1 TVAG_219680.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_228390.1 TVAG_228390.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_247080.1 TVAG_247080.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_281080.1 TVAG_281080.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_298210.1 TVAG_298210.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_298220.1 TVAG_298220.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_298390.1 TVAG_298390.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_298400.1 TVAG_298400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_302910.1 TVAG_302910.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_303820.1 TVAG_303820.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_357550.1 TVAG_357550.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_373640.1 TVAG_373640.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_383500.1 TVAG_383500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_383510.1 TVAG_383510.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_406810.1 TVAG_406810.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_411720.1 TVAG_411720.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_451970.1 TVAG_451970.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_451980.1 TVAG_451980.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_453250.1 TVAG_453250.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_456550.1 TVAG_456550.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_459490.1 TVAG_459490.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_467090.1 TVAG_467090.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_490080.1 TVAG_490080.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_505880.1 TVAG_505880.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=adenylate cyclase type III%2C putative,adenylyl cyclase type V%2C putative,adenylate cyclase type VI%2C putative,retinal guanylate cyclase%2C putative,adenylate cyclase%2C putative,guanylate cyclase%2C putative,adenylate cyclase type IX%2C putative,adenylate and guanylate cyclases%2C putative,guanylate cyclase beta 1 subunit%2C putative,soluble guanylate cyclase gcy%2C putative,adenylate cyclase type%2C putative,atrial natriuretic peptide receptor%2C putative,adenylate cyclase%2C type VII%2C putative;transl_table=1;translation=MTTSSKGSSHNFTYSVENSGSKYGGLITEAPHKRYFRKLQNLLNYMDSMLPPLIEMHIVISILRIVQLFGLVTCTNIRRLFNSDSTIFKFFNYLSIVWNVIPIEYRDQSSFYIALIYFLLNFAFFMTINISSRIYQNNGTVPRPLMYLIHFYVSIVGYLIHVPCLEIAFETFGLVVTNNTTSFSLPLNIVGLILSVISFLFYYFFIRQVYSFTFVFRPTSLLCVHGMQQMGTFMVPAVTAGLVGFCSFLPKILQIIDLIVSAFLAASSYFTIYRTFSLISSFHSTAILTFSTYAPICFILYAILLGLDQKIPPAGIFVWIGGLVVDFVICHFITQRRIKKALEFLDRALDDPSVLDAEQNDSKLLVNQLIGFQFAHPVALEFTIFKVLTDREPYSAKYWCTYGKFVAIYPELMELHNIIVKNIVQKKIHGIAAKQCIMTSNLIFQQRESNLTPDLKKKLAKVARTVASAKRKLRHIWDQVIQGNLAEMDGAIANAYKGVEECESEFNLLLMQLPNNRFLYRSYATYLFEVGNDFEKYKEISDQTRLLSRGIRVMPDHTQDLGLLAFPNLPQILATATGLGTKVASEVSESFVFAETDLDDETITKKVEDNRQIMEMINTLKIPSLKTTIIIMGIITLVTCILGIICNSIRPIQNKHLYDILEYSLAASFIRHHAGLATSLSHLFILTKLGLMPNPNYTMKILGNEQDIQGMVKYSLSQIISHTRTLQDFRGKYMGNYYMDYTRMLLYDPTINMTFFYPDMRRQDHTMTSTYEGLQKMIIRLTEIADLNPNNITNFTLMTPSMRDPFVNIFDIVNQLAFATMNITGFMEERIENMDNIMKWTMIGGSLFFVLLFVVVLLIIWKKIAKDKITVYKTLLNLPKNVVSQVSESLRLVKDGSNTATDHSTNLMSTAKFNLDLERNRQEENLLKVLNAASDESAHQSELSIMSTILFFYFVYCIVCCVMLSLMYVSQGNQFNKDAQHSDNLMTISAFMSILTMQLNDIAGIFNLGVITDDLNLAKKFAFVDILRDNISDYYTALRFGYGKGAPYPDFTYLTNQSLVAAGCNKTKIPDNYSVAFACMTPDSIIRYAYTFVSSLIDNSVVSKVRNKGTFPVDPVRLPIANHLIQFRVIDDFLYPGTQIIGQLVKKGEKDAIKTYMAINILCIILIIFCFIAIVRITVLIERKLKFTLKLLLFCQAQLVLQNSFIVSVLNGNFGNGSEDTLTRDNDFYDTLVEEMPDAIVVTHGQQYTVTKMNKAANNLFKEDLMGQVFSNFIQSSAFKTETTQKLQKLFDAQTKSANCSVQYKNGDPDKEFLSISKIATLNNETTVFTIRNTTQMYLYNKLILEERAKSDKLLSSILPASLVPRVQAGESNISFSVPSATVCFIDIVEFTPWCGSNSAQVVTGTLNMIFSEFDTIVASLPFMERVKCIGDCYMCAGGVFAESPNVLQFARSTVDFGLQAIEAIGRVNEAANQNLKVRVGIHIGGPIVAGVIGIGKPTFEIFGPAISYAQQMEHNGVPMKVHISRAVYELIYGGNYKIEERGEIQIKQGKVVTYLVSL NODE_10027.1 feature CDS 6895 10506 . - 0 colour=10;gene_id=TGA_000005300;transcript_id=TGA_000005300.1;ortholog=TGA_000279400.1 TGA_000279400.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001129900.1 TGA_001129900.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_001950500.1 TGA_001950500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TGA_002170500.1 TGA_002170500.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_071830.1 TVAG_071830.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_146420.1 TVAG_146420.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_203850.1 TVAG_203850.1%3Bprogram%3DOrthoMCL%3Brank%3D0,TVAG_224590.1 TVAG_224590.1%3Bprogram%3DOrthoMCL%3Brank%3D0;product=hypothetical protein%2C conserved;transl_table=1;translation=MGLFTVAEYITMVFFRPSMFLMTSVLDVSTGATDLWIYCARSMTFASYIFLASYNTEIIELIIYLVYPIFVIYVILKRIAYGVNLAAFGSLLNDGPTFAAPFILLSHLHPCKYFPPILVLVIMLTVYYLILLILKLYLKKNATKIVLQKSTAPWYLPVTNALLLRYAAKTDGNIEEFENYMLIKLDQDPEAMIEIIRFLAIFKSQRNQLLMMLSVWKQPSLYYNYQFYLFKKIMSAGNELAPDQTIEIVDRLHRNYIVMMSMFWAAREKKNNFEAFICATKAATMHCEMYSQIKYTSFFYHSDPYIYNAYSEFMLIGLAEPMKSLIYRRYAATLKDNPGSITDPFFRRIAQYYPISSERFSSEFSSSNHSSSHTKSSTNISGKYTFSQLRFHIDNNTIYREESKDTSYIAMFVEKSKHFKTHGIVISFILTILYAVYYTNFTNIDQRKIADGVNNIKGIVSDTIRLYKRLTAGIYLKEIIDKSNISFDEECINSIKDIYVDIIGLSYIVKDEYNLYTSTLSFLSDYVSRSTCKELRNSSKLVMETLDSIKINYNYFHKNVSEVLKVDFNKNHFNLFLLYVVIAFIGSMVTFGVTFNLLRTALNDIPAEAVRFLASKERLSFLLLKKSLESYDLFKILFHEQPKEKPTSIKREKIPKIINKASTKSDYVPPLKSQHFSLKDAKLAISFLGTDEKYLKVSSFGLMNRFLGPTSPIPKNLNDHGDWSESASNSGDDYVLKENVEGMDIVSNTVDSVKKEFNYFGLVIFCLFLTPWLPAITIIHISTFVIDFQQKNNNIYLLKLKDDIDQFFTLPRELYINITVKNISERIPDKYTQKLSHYSKMMIPYIEEANKINSYSLTGWFAWSILAYELWILCLIAMYHMEKDLNLGFDSLFHFPRGYLDDLNKPKDAAPEEKLPSNILELVVSSNSRYINHISPNCEDLISISDIDIIGKKYDEVFEKEDENSNIRLFKINPHKTKKFVESSFECGRVERIALLDEMSGATYNNPINNILQKHIPQQMAKLFCNNNLRTYRPGEIFLIVASYDITEYNSRSDQIFLSAHNLIQFYSTVNIIKCDGSQIYFICVADDRNIPILFIRDFVASALPPKSSLLVKHASSPLLSIVIERTNFSADVEITSEPFISIDENKMKTITTSLFAAKPKTLFAIDDCLKISEYGVREQLKSKVNGVSVEFETIEKIVDELD

I suspected a memory issue, but have tried this on multiple machines, including a 12 core, 50Gb computer and a powerful bioinformatics server at my facility, with the same issue

Any help would be greatly appreciated

Nick

gpertea commented 6 years ago

What is the source of that annotation file? You only show the output of seqret and biopython attempts to "convert" the file but the problem seems to be the non-standard format of the original file TgalA1GenomeAnnotation2.gtf, which you did not show. For example what is the output of this command:

 fgrep TGA_000005100.1 TgalRNAseqAnalysis2/SequenceData/TgalA1GenomeAnnotation2.gtf 

The proper resolution would be going back to the source and ask for a proper annotation format to be produced (GTF or GFF3). Barring that, some light scripting can probably be used to salvage the necessary annotation data from that file.. (and discard all that sequence, translation, ortholog data etc.) I guess I could come up with some quick'n'dirty perl code to do just that if you show me the original format, as I asked above.

NBaileyNCL commented 6 years ago

Thanks for the offer gpertea, you're right - the issue was with the annotation file format. The source was an embl file (unpublished, not sure it's in a database or not), but converting the annotations from gff to gtf using gffread, and the sam alignment to a sorted bam file using sam tools solved the issue.