alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Premade STAR index download page is unavailable. #867

Closed mranjan1 closed 4 years ago

mranjan1 commented 4 years ago

I'm constantly getting a "Gateway time out" error when I try to access http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/

Is there anyone else having the same problem?

Is there any other online repository where I can download pre-built STAR indices from?

alexdobin commented 4 years ago

Hi Manish,

we are having some network problems at the lab for a few days. Our IT is working to resolve it. There is no place for now to download files from. I think it's best to actually generate the indexes yourselve, as I have not updated the generated genomes for a long time.

Cheers Alex

mranjan1 commented 4 years ago

Thank you Alex. I had a 'minimum hardware requirement' issue since I am unable to access my HPCC - but I built the index on AWS for now.

Best, Manish

jolespin commented 2 years ago
  1. Is this genome/index still the preferred pre-built human STAR index?

  2. If you were to build this from the most version on NCBI GCA_000001405.28_GRCh38.p13 Would you just use the following files:

With this command?

STAR --runThreadN 4 --runMode genomeGenerate --genomeSAindexNbases 12 --genomeDir ./ --genomeFastaFiles ${GENOME} --sjdbOverhang 99 --sjdbGTFfile ${GTF} --limitGenomeGenerateRAM 15000000000 --genomeSAsparseD 3 --limitIObufferSize 50000000 --limitSjdbInsertNsj 383200
  1. Is the no_alt_analysis_set preferred over the primary assembly?

This UCSC thread mentions:

The no_alt_analysis_set is the one most likely to be relevant for most aligners. It removes alternate alleles. Most aligners cannot yet use alternate alleles.

Edit: I got this error trying to reproduce the index command in [2]:

EXITING because of FATAL input ERROR: --limitIObufferSize requires 2 numbers since 2.7.9a.
SOLUTION: specify 2 numbers in --limitIObufferSize : size of input and output buffers in bytes.

Jan 16 01:59:57 ...... FATAL ERROR, exiting

I'm running this version:

STAR --version
2.7.10a
alexdobin commented 2 years ago

Hi Josh,

the pre-built indexes are not supported at the moment. It's best to build an index with the current STAR version and current annotations.

no_alt_analysis_set is indeed the right FASTA to use. I recommend using "PRImary" FASTA and GTF from GENCODE: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/GRCm39.primary_assembly.genome.fa.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M28/gencode.vM28.primary_assembly.annotation.gtf.gz

Cheers Alex

jolespin commented 2 years ago

Thank you for the links out. I'll find the human versions and get those running today:

Do you recommend any critical parameters to adjust besides --sjdbOverhang (read length minus 1)?

Edit: I'm using 151 bp long reads and this is the command I ended up using (current GENCODE version as of this post).

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/latest_release/GRCh38.primary_assembly.genome.fa.gz
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/latest_release/gencode.v39.primary_assembly.annotation.gtf.gz

gzip -d *.gz

GENOME=GRCh38.primary_assembly.genome.fa
GTF=gencode.v39.primary_assembly.annotation.gtf

STAR --runThreadN 24 --runMode genomeGenerate --genomeSAindexNbases 12 --genomeDir . --genomeFastaFiles ${GENOME} --sjdbOverhang 150 --sjdbGTFfile ${GTF}
alexdobin commented 2 years ago

Hi Josh,

you command loos good. There are no critical parameters, but here are some you may want to consider (from ENCODE):

--outFilterType                  BySJout    //reduces the number of "spurious" junctions
--outFilterMultimapNmax          20         //max number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped
--alignSJoverhangMin             8          //min overhang for unannotated junctions
--alignSJDBoverhangMin           1          //min overhang for annotated junctions
--outFilterMismatchNmax          999        //max number of mismatches per pair (absolute)
--outFilterMismatchNoverLmax     0.06       //max number of mismatches per pair relative to read length: for 2x100b, max number of mismatches is 0.06*200=12 for the paired read
--alignIntronMin                 20         //min intron
--alignIntronMax                 1000000    //max intron
--alignMatesGapMax               1000000    //max genomic distance between pairs

Cheers Alex

annamariabugaj commented 2 years ago

I would like to download the prebuild human genome index but I am not sure how to do this and what is what in the files, could someone please explain me how to download it from this website? https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/GRCh38_Ensembl99_sparseD3_sjdbOverhang99/

jolespin commented 2 years ago

IIRC most of that (or the entire) directory is the first need. The index is a directory that has the genome coordinates you need to run STAR so when you run STAR you would provide the path to that directory that you've downloaded. That directory would be the genome index you use as a reference.

annamariabugaj commented 2 years ago

Thank you! I am a bit confused with the download - should I use wget and the whole path?

alexdobin commented 2 years ago

Hi @BubuAalbu

presently I am not making premade indexes available. Please generate the index from the proper FASTA and GTF files.