DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Output not getting produced by dsub #248

Closed arao0912 closed 2 years ago

arao0912 commented 2 years ago

I am an undergrad that is new to dsub and cloud bioinformatics. I am using STAR to map transcripts to the genome. The logs show that the mapping went without errors, but there is no output in the bucket I specified. I have attached the dsub script and the relevant logs. Please let me know if I should add any other parameters.

   dsub \
    --provider xxx \
    --project xxx \
    --location xxx \
    --zones xxx \
    --preemptible \
    --min-ram xxx \
    --min-cores xxx \
    --logging gs://star-mapping-bucket/logging/ \
    --input-recursive GENOME_INDEX=gs://star-mapping-bucket/genome_index \
    --input R1=gs://star-mapping-bucket/input/RT1_R_1.fastq-002.gz \
    --input R2=gs://star-mapping-bucket/input/RT1_R_2.fastq-001.gz \
    --output-recursive OUTPUT=gs://star-mapping-bucket/output/RT1 \
    --image registry.gitlab.com/hylkedonker/rna-seq \
    --script step22.sh 

Step22 script

STAR \
    --runThreadN 24 \
    --runMode alignReads \
    --readFilesCommand gunzip -c \
    --genomeDir ${GENOME_INDEX} \
    --readFilesIn ${R1} ${R2} \
    --outFileNamePrefix ${OUTPUT}

Logs:

Oct 28 20:41:20 ..... finished mapping
Oct 28 20:41:23 ..... finished successfully
2022-10-28 20:41:25 INFO: Delocalizing OUTPUT
2022-10-28 20:41:25 INFO: gsutil  -mq rsync -r /mnt/data/output/gs/star-mapping-bucket/output/RT1/ gs://star-mapping-bucket/output/RT1/
carbocation commented 2 years ago

I would suggest doing some logging in your script to understand where the error is. For example, I would print out the ${OUTPUT} variable to make sure it looks directory-like. After STAR runs, I would see whether it actually put the files into the ${OUTPUT} folder.

Since the argument is called --outFileNamePrefix, one guess is that STAR is not placing the files into your ${OUTPUT} folder; it's just putting them in the working directory. If that's true, you might give the files a prefix (e.g., --outFileNamePrefix=PRE) and then after STAR runs, copy all PRE* files to ${OUTPUT}:

# Checking that this folder path looks expected
echo "${OUTPUT}"

# Does GENOME_INDEX look like you expect?
echo "${GENOME_INDEX}"

# Are its contents populated like you expect?
/bin/ls ${GENOME_INDEX}

STAR \
    --runThreadN 24 \
    --runMode alignReads \
    --readFilesCommand gunzip -c \
    --genomeDir ${GENOME_INDEX} \
    --readFilesIn ${R1} ${R2} \
    --outFileNamePrefix PRE

# Did it create PRE* files?
/bin/ls PRE*

# This step assumes that there are files in the current working directory with prefix PRE:
cp PRE* ${OUTPUT}/

This is just my guess, but the point is that with some logging I suspect you'll be able to solve this.

wnojopra commented 2 years ago

Hi @arao0912 !

I do recommend following carbocation's suggestions in the above comment. It's likely the best way to figure out the specifics of your problem. I would emphasize the point that you should confirm the output files exist in the location you expect.

Some other things that may help - 1) General docs for dsub i/o are here 2) I happen to have an example of using STAR's outFileNamePrefix here. It's a bash script wrapped inside a WDL file. It looks like I found success by explicitly creating an output_dir, and then specifying that as the outFileNamePrefix.