broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
343 stars 175 forks source link

Building the indexes Issue (drive mapping issue maybe 【docker run "-v" ...】) #100

Closed yingsun-ucsd closed 5 months ago

yingsun-ucsd commented 5 months ago

I am building the indexes by following this, but got an error.

$ docker run --rm -v /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references:/data -t broadinstitute/gtex_rnaseq:V10     /bin/bash -c "STAR \
>         --runMode genomeGenerate \
>         --genomeDir /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/star_index_oh75 \
>         --genomeFastaFiles /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
>         --sjdbGTFfile /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf \
>         --sjdbOverhang 75 \
>         --runThreadN 4"
    **STAR --runMode genomeGenerate --genomeDir /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/star_index_oh75 --genomeFastaFiles /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta --sjdbGTFfile /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf --sjdbOverhang 75 --runThreadN 4**
    STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 26 22:16:44 ..... started STAR run
Jun 26 22:16:44 ... starting to generate Genome files

EXITING because of INPUT ERROR: could not open genomeFastaFile: /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta

Jun 26 22:16:44 ...... FATAL ERROR, exiting

However, the error did not make sense because:

$ head /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

If I ran this command from the docker run

STAR --runMode genomeGenerate --genomeDir /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/star_index_oh75 --genomeFastaFiles /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta --sjdbGTFfile /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf --sjdbOverhang 75 --runThreadN 4
Jun 26 15:10:34 ..... started STAR run
Jun 26 15:10:34 ... starting to generate Genome files
Jun 26 15:12:19 ... starting to sort Suffix Array. This may take a long time...
Jun 26 15:12:37 ... sorting Suffix Array chunks and saving them to disk...
...

directly on the server, it worked.

I am very new to docker and don't understand why the "docker run" did not work. Any help will be highly appreciated.

yingsun-ucsd commented 5 months ago

The same kind of errors to build the RSEM index:

$ docker run --rm -v /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references:/data -t broadinstitute/gtex_rnaseq:V10 \
>     /bin/bash -c "rsem-prepare-reference \
>         /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
>         /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/rsem_reference/rsem_reference \
>         --gtf /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf \
>         --num-threads 4"
rsem-extract-reference-transcripts /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/rsem_reference/rsem_reference 0 /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf None 0 /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta
Cannot open /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf! It may not exist.
"rsem-extract-reference-transcripts /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references/rsem_reference/rsem_reference 0 /nfs/lab/GTEx/GENCODE/gencode.v39.GRCh38.annotation.gtf None 0 /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!
francois-a commented 5 months ago

You need to use the right path in the docker environment. You're mapping the input to /data, so /nfs/lab/GTEx/references/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta should be /data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta etc.

yingsun-ucsd commented 5 months ago

Thank you so much for your help, @francois-a!

$ pwd
/nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references
$ ls
Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta
gencode.v39.GRCh38.annotation.gtf
Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta.fai
rsem_reference
star_index_oh75

In this case, I mapped "/nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references" to "data", and then ran the following, but still got errors. Did I have any misunderstanding here? Thanks.

$ docker run --rm -v /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium/references:/data -t broadinstitute/gtex_rnaseq:V10 \
>     /bin/bash -c "STAR \
>         --runMode genomeGenerate \
>         --genomeDir /data/star_index_oh75 \
>         --genomeFastaFiles /data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
>         --sjdbGTFfile /data/gencode.v39.GRCh38.annotation.gtf \
>         --sjdbOverhang 75 \
>         --runThreadN 4"
    STAR --runMode genomeGenerate --genomeDir /data/star_index_oh75 --genomeFastaFiles /data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta --sjdbGTFfile /data/gencode.v39.GRCh38.annotation.gtf --sjdbOverhang 75 --runThreadN 4
    STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 27 16:57:07 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into /data/star_index_oh75/Log.out. Will keep ./Log.out

Jun 27 16:57:07 ... starting to generate Genome files

EXITING because of INPUT ERROR: could not open genomeFastaFile: /data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta

Jun 27 16:57:07 ...... FATAL ERROR, exiting
yingsun-ucsd commented 5 months ago

It looks like I have this mapping folder issue. For example,

$ pwd
/nfs/lab/ysun/Pankbase/GSE79469/fastq
$ ls
star_index_oh75     SRR1299319_1.fastq.gz     SRR1299319_2.fastq.gz
$ docker run --rm -v /nfs/lab/ysun/Pankbase/GSE79469/fastq:/data -t broadinstitute/gtex_rnaseq:V10 \
>     /bin/bash -c "/src/run_STAR.py \
>         /data/star_index_oh75 \
>         /data/SRR1299319_1.fastq.gz \
>         /data/SRR1299319_2.fastq.gz \
>         SRR1299319 \
>         --threads 4 \
>         --output_dir /tmp/star_out && mv /tmp/star_out /data/star_out"
    STAR --runMode alignReads --runThreadN 4 --genomeDir /data/star_index_oh75 --twopassMode Basic --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --readFilesIn /data/SRR1299319_1.fastq.gz /data/SRR1299319_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /tmp/star_out/SRR1299319. --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --genomeLoad NoSharedMemory --quantTranscriptomeSAMoutput BanSingleEnd_BanIndels_ExtendSoftclip --winAnchorMultimapNmax 50 --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType Junctions WithinBAM SoftClip --chimMainSegmentMultNmax 1 --chimOutJunctionFormat 0 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline ID:rg1 SM:sm1
    STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 27 18:41:34 ..... started STAR run
Jun 27 18:41:34 ..... loading genome

EXITING because of FATAL ERROR: could not open genome file /data/star_index_oh75//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions

Jun 27 18:41:34 ...... FATAL ERROR, exiting
Traceback (most recent call last):
  File "/src/run_STAR.py", line 124, in <module>
    subprocess.check_call(cmd, shell=True, executable='/bin/bash')
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'STAR --runMode alignReads --runThreadN 4 --genomeDir /data/star_index_oh75 --twopassMode Basic --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterType BySJout --outFilterScoreMinOverLread 0.33 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 --readFilesIn /data/SRR1299319_1.fastq.gz /data/SRR1299319_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix /tmp/star_out/SRR1299319. --outSAMstrandField intronMotif --outFilterIntronMotifs None --alignSoftClipAtReferenceEnds Yes --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMunmapped Within --genomeLoad NoSharedMemory --quantTranscriptomeSAMoutput BanSingleEnd_BanIndels_ExtendSoftclip --winAnchorMultimapNmax 50 --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType Junctions WithinBAM SoftClip --chimMainSegmentMultNmax 1 --chimOutJunctionFormat 0 --outSAMattributes NH HI AS nM NM ch --outSAMattrRGline ID:rg1 SM:sm1' returned non-zero exit status 105.

I am new to docker and really need some help to understand what's going on here. Thanks!

yingsun-ucsd commented 5 months ago

docker run -it --rm --user XXXX:XXXX -v /nfs/lab/ysun/RNA-seqPipeline4GTExConsortium:/data --workdir /data -t broadinstitute/gtex_rnaseq:V10 \

Fixed it.