alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 502 forks source link

early termination 'std::out_of_range'/vector::_M_range_check when processing annotations GTF #1324

Closed cmandreani closed 3 years ago

cmandreani commented 3 years ago

Hi Alex,

I'm using STAR for mapping bacterial genomes.

I'm retrieving an error when generating the index with the following code:

STAR --runThreadN 30 --runMode genomeGenerate --genomeDir path_to_genome_dir --genomeFastaFiles path_to_fasta.fna --sjdbGTFfile path_to_GFF.gff --sjdbOverhang 149 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --alignIntronMax 1 --genomeSAindexNbases 12

Genome length: 51.097.461 bp Reads length: 150 bp

The error is the following: terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check Abortado (`core' generado)

Reading other issues (#84 and #548) I realised that the ninth column of the GFF does not contain "transcript_id" or "gene_id" and that instead of "exon" in the third column the regions of interest to map are "CDS". I specified this, following the instructions in the manual (4-May-2021). I am not sure if when dealing with gff3 --sjdbGTFtagExonParentTranscript should be set to "Parent" or to "ID"; I tried both with no changes. Also, I checked that the chromosome names were the same in the sequence and annotation files; which was the case as I merged and converted upstream antiSMASH output (GBK; n=1911) into GFF3 and FASTA formats in the same operation. Any ideas of what might be terminating the process?

Here a line of how the GFF looks like:

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ISL001_ctg1 | GenBank | cand_cluster | 1 | 21032 | . | + | 1 | ID=furan.cand_cluster;Alias=furan;Name=furan;candidate_cluster_number=1;contig_edge=False;detection_rules=(mmyO or AvrD);kind=single;product=furan;protoclusters=1;tool=antismash -- | -- | -- | -- | -- | -- | -- | -- | -- ISL001_ctg1 | GenBank | DNA | 1 | 21032 | . | . | 1 | ID=ISL001_ctg1;Alias=ISL001_ctg1;Name=ISL001_ctg1;Note=contig1.,\n##antiSMASH-Data-START##\nVersion :: 5.1.2\nRun date :: 2020-08-04 00:32:29\nNOTE: This is a single cluster extracted from a larger record!\nOrig. start :: 102528\nOrig. end :: 123560\n##antiSMASH-Data-END##;comment1=\n##antiSMASH-Data-START##\nVersion :: 5.1.2\nRun date :: 2020-08-04 00:32:29\nNOTE: This is a single cluster extracted from a larger record!\nOrig. start :: 102528\nOrig. end :: 123560\n##antiSMASH-Data-END##;date=01-JAN-1980 ISL001_ctg1 | GenBank | protocluster | 1 | 21032 | . | + | 1 | ID=furan;Name=furan;aStool=rule-based-clusters;contig_edge=False;core_location=[112528:113560](-);cutoff=20000;detection_rule=(mmyO or AvrD);neighbourhood=10000;product=furan;protocluster_number=1;tool=antismash ISL001_ctg1 | GenBank | region | 1 | 21032 | . | + | 1 | ID=furan.region;Alias=furan;Name=furan;candidate_cluster_numbers=1;contig_edge=False;product=furan;region_number=1;rules=(mmyO or AvrD);tool=antismash ISL001_ctg1 | GenBank | CDS | 886 | 1209 | . | - | 1 | ID=ISL001_ctg1_116;Name=ISL001_ctg1_116;transl_table=11;translation=length.107 ISL001_ctg1 | GenBank | CDS | 1641 | 2252 | . | + | 1 | ID=ISL001_ctg1_117;Name=ISL001_ctg1_117;gene_functions=regulatory (smcogs) SMCOG1016:LuxR family DNA-binding response regulator (Score: 121.3%3B E-value: 5.4e-37);gene_kind=regulatory;transl_table=11;translation=length.203 ISL001_ctg1 | GenBank | CDS | 2506 | 3093 | . | + | 1 | ID=ISL001_ctg1_118;Name=ISL001_ctg1_118;gene_functions=regulatory (smcogs) SMCOG1032:RNA polymerase%2C sigma-24 subunit%2C ECF subfamily (Score: 99.6%3B E-value: 2.7e-30);gene_kind=regulatory;transl_table=11;translation=length.195 ISL001_ctg1 | GenBank | CDS | 3224 | 4726 | . | + | 1 | ID=ISL001_ctg1_119;Name=ISL001_ctg1_119;transl_table=11;translation=length.500 ISL001_ctg1 | GenBank | CDS | 4840 | 5964 | . | + | 1 | ID=ISL001_ctg1_120;Name=ISL001_ctg1_120;transl_table=11;translation=length.374 ISL001_ctg1 | GenBank | CDS | 6324 | 7301 | . | - | 1 | ID=ISL001_ctg1_121;nRPS_PKS=Domain: PKS_ER (23-318). E-value: 4.5e-51. Score: 165.5. Matches aSDomain: nrpspksdomains_ctg1_121_PKS_ER.1,type: other;Name=ISL001_ctg1_121;gene_functions=biosynthetic-additional (smcogs) SMCOG1028:crotonyl-CoA reductase / alcohol dehydrogenase (Score: 286.5%3B E-value: 4.4e-87);gene_kind=biosynthetic-additional;transl_table=11;translation=length.325 ISL001_ctg1 | GenBank | aSDomain | 6348 | 7232 | 165.5 | - | 1 | ID=aSDomain:PKS_ER;aSF=ER configuration inconclusive;Name=ctg1_121;aSDomain=PKS_ER;aSTool=nrps_pks_domains;database=nrpspksdomains.hmm;detection=hmmscan;domain_id=nrpspksdomains_ctg1_121_PKS_ER.1;evalue=4.50E-51;label=ctg1_121_PKS_ER.1;protein_end=318;protein_start=23;tool=antismash;translation=LKLIETDRPVPGPTEILVRVHAAGVNPTDWKTRARGVYVNGVRPPFRLGFDVSGVVEAVGAGVTVFAPGDEVFGMPRFPHPAGAYAEYVTGPARHFTLRPAGQDHIHTAALPLAALTAWQALVDTADIRPGQRVLVHAAAGGVGHLAVQIAKARGAYVIGTARTAKHDFLRGLGADELVDYTQQEFAEVIRDVDVVLDPVGGDCSIRSLRTLRPGGVLISLIPPDETFPAEQARAAGVRAVFMLVEPDQAGLREIAALVDSGQLRAEIAAAVPLEEAAKAHELGETGRTAGKIVLS ISL001_ctg1 | GenBank | aSModule | 6348 | 7232 | . | + | 1 | ID=GenBank:aSModule:ISL001_ctg1:6348:7232;domains=nrpspksdomains_ctg1_121_PKS_ER.1;incomplete=_no_value;locus_tags=ctg1_121;tool=antismash;type=unknown ISL001_ctg1 | GenBank | CDS_motif | 6654 | 6683 | -2.0 | - | 1 | ID=ctg1_121.CDS_motif;Alias=ctg1_121;Name=ctg1_121;aSTool=nrps_pks_domains;database=abmotifs;detection=hmmscan;domain_id=nrpspksmotif_ctg1_121_0003;evalue=4.80E+01;label=PKSI-ER_m2;protein_end=216;protein_start=206;tool=antismash;translation=length.11

Cheers, Constanza

cmandreani commented 3 years ago

I solved it :) --sjdbGTFtagExonParentTranscript corresponded to "Name" instead of "ID".