alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 502 forks source link

std::out_of_range / vector::_M_range_check during genomeGenerate #84

Closed chandni177 closed 5 years ago

chandni177 commented 8 years ago

Hi there,

I've been using STAR for a while now, but I have some new GTF files from a collaborator which give this out of range error during the indexing step. Upon reading previous posts about similar errors, I made sure the chromosome names were the same, the gtf file sorted. I still get the same error. Not sure what could be going wrong.

I can send the error message or a part of the gtf file or my command - whichever helps to debug the issue.

Thanks! Chandni

STAR --runThreadN 32 --runMode genomeGenerate --sjdbOverhang 49 --genomeDir GenomeIndex_Try/ --genomeFastaFiles 'GRCm38.p4.genome.fa' --sjdbGTFfile 'testGTF.sorted.gtf'

alexdobin commented 8 years ago

Hi Chandi,

one of the likely reasons for this problem is the lack of "transcript_id" attribute in some of the exonic lines. If this is not the case, please send me a small part of the GTF that still causes the problem.

Cheers Alex

chandni177 commented 8 years ago

Yes, that's true. But I was assuming that just leads to a warning, not an error.

Here is what the gtf looks like -

chr1 RepeatMasker exon 100000467 100001474 . - . gene_id "LTR/ERVK|RLTR13D5|1|100000467|100001474"; chr1 RepeatMasker exon 100001476 100001763 . - . gene_id "LINE/L1|Lx9|1|100001476|100001763";

alexdobin commented 8 years ago

Hi Chandni,

STAR requires "transcript_id" in each "exon" line of the GTF file. Otherwise it would not know which transcript the exon belongs to, and won't be able to make junctions. I will replace this hard requirement with a warning in the next release. For now, please filter out these lines from the GTF file. Or, add the "transcript_id" to each of these lines, with the value equal to the gene_id - make sure that the exons that belong to the same transcript=gene do not overlap.

Cheers Alex

chandni177 commented 8 years ago

I see, I misunderstood then. Alright, I added the transcript ids to be the same as the gene ids, and it works fine now.

Thanks for helping out! Chandni

alexdobin commented 8 years ago

Hi Chandni,

not sure if this is still open, since I cannot find this message on GitHub?

Cheers Alex

From: Chandni Desai [mailto:notifications@github.com] Sent: Friday, November 13, 2015 4:10 PM To: alexdobin/STAR Cc: Dobin, Alexander Subject: Re: [STAR] std::out_of_range / vector::_M_range_check during genomeGenerate (#84)

Hi Alex,

I fixed the gtf and it worked fine, but I now have 2 GTF files that need to be a part of the Indexing step. Here is the command I use, and I get the same std::out_of_range error once again.

STAR --runThreadN 48 --runMode genomeGenerate --genomeDir GenomeIndex/ --genomeFastaFiles 'GRCm38.p4.genome.fa' --sjdbGTFfile <(cat GRCm38.81.chrAdded.gtf GRCm38.78_senseOnlyFiltered_transcriptID_added.gtf)> --sjdbOverhang 49

Also - this is how the 2 files look -

GRCm38.81.chrAdded.gtf

chr1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; chr1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; tag "basic"; transcript_support_level "NA";

GRCm38.78_senseOnlyFiltered_transcriptID_added.gtf

chr1 RepeatMasker exon 3000001 3002128 . - . gene_id "LINE/L1|L1MdFanc_I|1|3000001|3002128"; transcript_id "LINE/L1|L1MdFanc_I|1|3000001|3002128"; chr1 RepeatMasker exon 3003148 3004054 . - . gene_id "LINE/L1|L1MdFanc_I|1|3003148|3004054"; transcript_id "LINE/L1|L1MdFanc_I|1|3003148|3004054";

Thanks for the help once again! Chandni

— Reply to this email directly or view it on GitHubhttps://github.com/alexdobin/STAR/issues/84#issuecomment-156557495.

chandni177 commented 8 years ago

Hi Alex,

I just concatenated the 2 files and it worked fine, so I deleted it from Github.

Thanks for reaching out though!

Chandni

On Wed, Nov 18, 2015 at 4:06 PM, alexdobin notifications@github.com wrote:

Hi Chandni,

not sure if this is still open, since I cannot find this message on GitHub?

Cheers Alex

From: Chandni Desai [mailto:notifications@github.com] Sent: Friday, November 13, 2015 4:10 PM To: alexdobin/STAR Cc: Dobin, Alexander Subject: Re: [STAR] std::out_of_range / vector::_M_range_check during genomeGenerate (#84)

Hi Alex,

I fixed the gtf and it worked fine, but I now have 2 GTF files that need to be a part of the Indexing step. Here is the command I use, and I get the same std::out_of_range error once again.

STAR --runThreadN 48 --runMode genomeGenerate --genomeDir GenomeIndex/ --genomeFastaFiles 'GRCm38.p4.genome.fa' --sjdbGTFfile <(cat GRCm38.81.chrAdded.gtf GRCm38.78_senseOnlyFiltered_transcriptID_added.gtf)> --sjdbOverhang 49

Also - this is how the 2 files look -

GRCm38.81.chrAdded.gtf

chr1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; chr1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; tag "basic"; transcript_support_level "NA";

GRCm38.78_senseOnlyFiltered_transcriptID_added.gtf

chr1 RepeatMasker exon 3000001 3002128 . - . gene_id "LINE/L1|L1MdFanc_I|1|3000001|3002128"; transcript_id "LINE/L1|L1MdFanc_I|1|3000001|3002128"; chr1 RepeatMasker exon 3003148 3004054 . - . gene_id "LINE/L1|L1MdFanc_I|1|3003148|3004054"; transcript_id "LINE/L1|L1MdFanc_I|1|3003148|3004054";

Thanks for the help once again! Chandni

— Reply to this email directly or view it on GitHub< https://github.com/alexdobin/STAR/issues/84#issuecomment-156557495>.

— Reply to this email directly or view it on GitHub https://github.com/alexdobin/STAR/issues/84#issuecomment-157879894.

Roghaiyeh commented 6 years ago

Hello, I encounter the same terminated message when I try to make the inex. I tried with different species and in all there is a problem. I believe that it is because of the GTF file. because when I delet sjdbGTFfile $gtf --sjdbOverhang 99 it start to continue to make index. But ofcorse not gtf. is there some body to help me to figure out this problem. Thanks, Rugi

!/bin/bash Submission script for HMEM SBATCH --job-name=STAR SBA TCH --time=05:00:00 hh:mm:ss SBATCH --ntasks=1 SBATCH --mem-per-cpu=61440 60GB SBATCH --partition=Low --mail-type=ALL

module load STAR user=rsafari reference=/home/users/r/s/rsafari/human/Homo_sapiens.GRCh38.dna_s m.toplevel.fa.gz gtf=/home/users/r/s/rsafari/human/Homo_sapiens.GRCh38.93.gtf.gz indexOutput=/home/users/r/s/rsafari/human/

STAR --runThreadN 1 --runMode genomeGenerate --genomeDir $indexOu tput --genomeFastaFiles $reference --sjdbGTFfile $gtf --sjdbOverhang 99 Sep 10 12:01:36 ..... Started STAR run Sep 10 12:01:36 ... Starting to generate Genome files Sep 10 12:01:38 ... Starting GTF processing terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check

alexdobin commented 6 years ago

Hi @Roghaiyeh

you need to gunzip both the FASTA and GTF file - this is what causes the problem. Also, please use "primary assembly" FASTA, e.g. ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Cheers Alex

Roghaiyeh commented 6 years ago

Thank you Alex.

Cheers rugi

joshuagi commented 4 years ago

Hi Alex,

I am trying to construct a synthetic transcriptome from the gencode .gtf annotation & I'm having issues creating the STAR index. I'm using another tool to create the annotation, so I don't know if you'll be able to help me.. although based on the conversation above I am beginning to suspect that STAR needs to retrieve information in my .gtf that isn't there.

To create this annotation I am using the CreateSyntheticTranscripts() function of the easyRNAseq package. https://rdrr.io/bioc/easyRNASeq/src/R/easyRNASeq-synthetic-transcripts.R https://bioconductor.org/packages/release/bioc/vignettes/easyRNASeq/inst/doc/simpleRNASeq.html

I keep getting an error at the following step:

..... processing annotations GTF terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

As far as I know, the altered .gtf file I am using is in the correct format. To verify this, I have been using the import() function of the R library "rtracklayer", and it reads in correctly. I noticed that the collapsed annotation does not have a 'transcript_id' column, so I created one from the 'gene_id' column. However this did not solve the problem since all my exons have NA for the gene_id variable.

The gtf I've created has the following format. Do you have any suggestions on how I can make it compatible with STAR? The example below is for one gene locus.

chr1 easyRNASeq gene 11869 14409 . + . ID "ENSG00000223972.5"; gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level "2"; havana_gene "OTTHUMG00000000961.2"; chr1 easyRNASeq transcript 11869 14409 . + . ID "ENSG00000223972.5.0"; Parent "ENSG00000223972.5"; chr1 easyRNASeq exon 11869 12227 . + . ID "ENSG00000223972.5.exon.1"; Parent "ENSG00000223972.5.0"; Name "ENSG00000223972.5.exon.1"; chr1 easyRNASeq exon 12613 12721 . + . ID "ENSG00000223972.5.exon.2"; Parent "ENSG00000223972.5.0"; Name "ENSG00000223972.5.exon.2"; chr1 easyRNASeq exon 12975 13052 . + . ID "ENSG00000223972.5.exon.3"; Parent "ENSG00000223972.5.0"; Name "ENSG00000223972.5.exon.3"; chr1 easyRNASeq exon 13221 14409 . + . ID "ENSG00000223972.5.exon.4"; Parent "ENSG00000223972.5.0"; Name "ENSG00000223972.5.exon.4";

Cheers,

Josh

joshuagi commented 4 years ago

Hi,

I managed to figure it out after giving it some more thought.

The index successfully completed after I assigned entries to a newly created "transcript_id" variable. I did it by parsing out the gene name from the "ID" column:

in R: synth$'transcript_id' <- str_split(synth$ID, pattern = '[.]', simplify = TRUE)[,1]

Josh

alexdobin commented 4 years ago

Hi Josh

you figured it out - by default STAR requires transcript_id and gene_id attributes, which are standard for GTF. In principle, you could change that by specifying --sjdbGTFtagExonParentTranscript ID but it's probably better to have a fully conforming GTF as other tools may rely on it as well.

Cheers Alex

cmandreani commented 3 years ago

Hi Alex,

I'm willing to use STAR for bacterial genomes. I wanted to ask if this is strongly unadvised or if there is a way to manage the main challenges of mapping reads to prokaryotes. (I know there are specific tools for this purpose, but I'm a beginner in bioinformatics and it would significantly simplify the execution of the methodology as STAR is the only tool I've used and it has a very complete manual.)

I obtained the input data from antismash (GBK) and converted it to GFF3 and FASTA formats. In the third column of the GFF, naturally, there is no Exon feature. However, CDSs correspond to the genes annotated for each sm-BGC (biosynthetic gene cluster that code for specialized metabolites) which is the only feature of interest in my research. I tried to specify this when generating the index with:

STAR --runThreadN 30 --runMode genomeGenerate --genomeDir path_to_genome_directory --genomeFastaFiles path_to_fasta.fa --sjdbGTFfile path_to_GFF.gff --sjdbOverhang 149 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS

But it still appeared the following error message: Fatal INPUT FILE error, no exon lines in the GTF file (...)

Reading the messages above, I also realised that the ninth column of the GFF does not contain "transcript_id" or "gene_id". How can I overcome this? Is it enough to replace "ID" for "transcript_id"? Here a line of how it looks like:

ISL001_ctg1 GenBank CDS 8645 8890 . - 1 ID=ctg1_124; nRPS_PKS=Domain: PP-binding (3-72). E-value: 5.9e-16. Score: 50.6. Matches aSDomain: nrpspksdomains_ctg1_124_PP-binding.1, type: other; Name=ctg1_124; gene_functions=biosynthetic-additional (rule-based-clusters) PP-binding; gene_kind=biosynthetic-additional; sec_met_domain=PP-binding (E-value: 3.3e-15%2C bitscore: 50.6%2C seeds: 164%2C tool: rule-based-clusters); transl_table=11; translation=length.81

More troubling, the ID refers to the name of the gene but not to the genome where it came from: there might be i.e. ISL001_ctg1_124 and ISL002_ctg1_124 and both ID's will be the same. Does the first column account for this, without needing any more details in the ninth to properly map?

Cheers, Constanza

alexdobin commented 3 years ago

Hi Constanza,

the GTF file is only needed for 1). Annotated splice junctions 2). Quantifying gene expression with --quantMode GeneCounts

If you are not using either, you can safely omit the GTF file.

Cheers Alex

cmandreani commented 3 years ago

Hi Alex,

I am indeed using --quantMode GeneCounts. I will open another issue with my current problem. Thanks