Output not getting produced by dsub

arao0912 commented 2 years ago

I am an undergrad that is new to dsub and cloud bioinformatics. I am using STAR to map transcripts to the genome. The logs show that the mapping went without errors, but there is no output in the bucket I specified. I have attached the dsub script and the relevant logs. Please let me know if I should add any other parameters.

   dsub \
    --provider xxx \
    --project xxx \
    --location xxx \
    --zones xxx \
    --preemptible \
    --min-ram xxx \
    --min-cores xxx \
    --logging gs://star-mapping-bucket/logging/ \
    --input-recursive GENOME_INDEX=gs://star-mapping-bucket/genome_index \
    --input R1=gs://star-mapping-bucket/input/RT1_R_1.fastq-002.gz \
    --input R2=gs://star-mapping-bucket/input/RT1_R_2.fastq-001.gz \
    --output-recursive OUTPUT=gs://star-mapping-bucket/output/RT1 \
    --image registry.gitlab.com/hylkedonker/rna-seq \
    --script step22.sh

Step22 script

STAR \
    --runThreadN 24 \
    --runMode alignReads \
    --readFilesCommand gunzip -c \
    --genomeDir ${GENOME_INDEX} \
    --readFilesIn ${R1} ${R2} \
    --outFileNamePrefix ${OUTPUT}

Logs:

Oct 28 20:41:20 ..... finished mapping
Oct 28 20:41:23 ..... finished successfully
2022-10-28 20:41:25 INFO: Delocalizing OUTPUT
2022-10-28 20:41:25 INFO: gsutil  -mq rsync -r /mnt/data/output/gs/star-mapping-bucket/output/RT1/ gs://star-mapping-bucket/output/RT1/

carbocation commented 2 years ago

I would suggest doing some logging in your script to understand where the error is. For example, I would print out the ${OUTPUT} variable to make sure it looks directory-like. After STAR runs, I would see whether it actually put the files into the ${OUTPUT} folder.

Since the argument is called --outFileNamePrefix, one guess is that STAR is not placing the files into your ${OUTPUT} folder; it's just putting them in the working directory. If that's true, you might give the files a prefix (e.g., --outFileNamePrefix=PRE) and then after STAR runs, copy all PRE* files to ${OUTPUT}:

# Checking that this folder path looks expected
echo "${OUTPUT}"

# Does GENOME_INDEX look like you expect?
echo "${GENOME_INDEX}"

# Are its contents populated like you expect?
/bin/ls ${GENOME_INDEX}

STAR \
    --runThreadN 24 \
    --runMode alignReads \
    --readFilesCommand gunzip -c \
    --genomeDir ${GENOME_INDEX} \
    --readFilesIn ${R1} ${R2} \
    --outFileNamePrefix PRE

# Did it create PRE* files?
/bin/ls PRE*

# This step assumes that there are files in the current working directory with prefix PRE:
cp PRE* ${OUTPUT}/

This is just my guess, but the point is that with some logging I suspect you'll be able to solve this.

wnojopra commented 2 years ago

Hi @arao0912 !

I do recommend following carbocation's suggestions in the above comment. It's likely the best way to figure out the specifics of your problem. I would emphasize the point that you should confirm the output files exist in the location you expect.

Some other things that may help - 1) General docs for dsub i/o are here 2) I happen to have an example of using STAR's outFileNamePrefix here. It's a bash script wrapped inside a WDL file. It looks like I found success by explicitly creating an output_dir, and then specifying that as the outFileNamePrefix.

DataBiosphere / dsub

Output not getting produced by dsub #248