dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

Writing `tmp_nonduplicates.#####` #102

Open BarryDigby opened 2 years ago

BarryDigby commented 2 years ago

Hi Tobias,

I've been running DCC on a dataset and have noticed writing tmp_nonduplicates.# files is taking an extremely long time. For context, here is the _tmp_DCC/ after 23 hours of running:

total 1.1G
-rw-r--r-- 1 bdigby 373M Mar 24 14:53 fust1_1.Chimeric.out.junction.PLJSNR
-rw-r--r-- 1 bdigby  15M Mar 25 14:12 tmp_duplicates.8D5DA9
-rw-r--r-- 1 bdigby 248M Mar 24 14:53 tmp_merged
-rw-r--r-- 1 bdigby  78M Mar 25 14:47 tmp_nonduplicates.8D5DA9
-rw-r--r-- 1 bdigby 164M Mar 24 14:53 tmp_printcirclines.8D5DA9
-rw-r--r-- 1 bdigby 248M Mar 24 14:53 tmp_twochimera

The resources requested for this job are as follows:

#!/bin/bash
#SBATCH -D /data/bdigby/Projects/large_test_data/work/19/da9c1aa6ff81627fd501b664e03b81
#SBATCH -J nf-DCC_(fust1_1)
#SBATCH -o /data/bdigby/Projects/large_test_data/work/19/da9c1aa6ff81627fd501b664e03b81/.command.log
#SBATCH --no-requeue
#SBATCH -c 16
#SBATCH -t 72:00:00
#SBATCH --mem 112640M
#SBATCH -p highmem
# NEXTFLOW TASK: DCC (fust1_1)

Can you offer any insights on what might be limiting this step? i.e do you think perhaps increasing/reducing available resources might expedite the process?

It would also be useful to get an idea of the final size of the tmp_nonduplicates.# - will it be a similar size to tmp_printcircles.#? This can help me gauge an appropriate TimeLimit through trial and error.


Another layer to this is two of the six samples have stopped running but bizarrely did not produce an exit code error. See below for the line in the nextflow log:

Mar-25 12:47:22.254 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 6058404; id: 97; name: DCC (N2_1); status: RUNNING; exit: -; error: -; workDir: /data/bdigby/Projects/large_test_data/work/fd/6a0841a7f3d2471b4483b52d998f6e started: 1648130944677; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270018

I contacted the system administrator but he was not able to see any evidence of resources being exceeded (nextflow would have also reported this).

Any insights as to why this step might fail would be extremely useful.


N.B The analysis is on WBcel253, having used DCC multiple times on human datasets, I am surprised by this behaviour with a relatively small reference genome.

Thanks in advance,

Barry