ablab / rnaquast

Quality assessment of de novo transcriptome assemblies from RNA-Seq data
http://cab.spbu.ru/software/rnaquast
Other
19 stars 6 forks source link

rnaQUAST job appears to be hanging with no progress for >36 hours #14

Open kalavattam opened 1 year ago

kalavattam commented 1 year ago

Hi, thank you for this great tool. My rnaQUAST job appears to be hanging with no progress for >36 hours. Nothing has been written to the rnaQUAST database (Saccharomyces_cerevisiae.R64-1-1.108.db) in >36 hours as well. I am running rnaQUAST with 285 assembly fastas; please see the attached rnaQUAST.log for system details and other details.

rnaQUAST.log

In short, I called rnaQUAST as follows:

rnaQUAST.py \
    -t "${SLURM_CPUS_ON_NODE}" \
    --labels ${n_GG[*]} \
    --transcripts ${f_GG[*]} \
    --reference "${p_ref}/${f_ref}" \
    --gtf "${p_gtf}/${f_gtf}" \
    --gmap_index "${p_gmap}/${d_gmap}" \
    --strand_specific \
    --left_reads "fastqs/merged_Q_IP_UTK_R1.fq.gz" \
    --right_reads "fastqs/merged_Q_IP_UTK_R3.fq.gz" \
    --output_dir "outfiles_rnaQUAST-test_Trinity-GG_Q-N/" \
    --busco_lineage "BUSCO/saccharomycetes_odb10.2020-08-05.tar.gz" \
    --gene_mark \
    --disable_infer_genes \
    --disable_infer_transcripts

...where "${SLURM_CPUS_ON_NODE}" is 32 and arrays ${n_GG[*]} and ${f_GG[*]} are composed of 285 elements each.

In checking the log, the job has been hanging at this step for >36 hours:

2023-02-18 17:55:33
Getting database coverage by reads...

Can you tell me if there is anything wrong with my invocation of rnaQUAST (e.g., perhaps I need to decompress the BUSCO database?), or is this lengthy amount of time with no apparent progress to be expected given the large number of assembly fastas?

Any feedback will be greatly appreciated. Thanks, Kris

kalavattam commented 1 year ago

Have killed the job, updated rnaQUAST from version 2.0.1 to 2.2.1, and started a new job (16 cores) with only one sample supplied to --transcripts, and without the --busco and --gene_mark options.

rnaQUAST.py \
    -t "${SLURM_CPUS_ON_NODE}" \
    --labels "${n_GG}" \
    --transcripts "${f_GG}" \
    --reference "${p_ref}/${f_ref}" \
    --gtf "${p_gtf}/${f_gtf}" \
    --gmap_index "${p_gmap}/${d_gmap}" \
    --strand_specific \
    --left_reads "${p_f_r1}" \
    --right_reads "${p_f_r3}" \
    --output_dir "${d_out}" \
    --disable_infer_genes \
    --disable_infer_transcripts

The job is currently on Getting database coverage by reads... for the last 15 minutes. Will give it some more time and then report back.

andrewprzh commented 1 year ago

Dear @kalavattam

Thank you for your feedback!

If you provide reads, rnaQUAST maps them back to the assembly, which may take a lot of time, days in some cases. I think we may switch to a faster aligner in the future, e.g. minimap. Could you try running rnaQUAST without reads? The main quality metrics are calculated just by using the reference genome and the gene annotation.

Best Andrey

kalavattam commented 1 year ago

Thank you for the advice, @andrewprzh! I was able to obtain the main quality metrics as you suggested. In addition, I tried a pilot experiment in which I called rnaQUAST.py with a single fasta supplied to --transcripts and a single sam file (18G in size) supplied to -sam. The job ran for ~1.5 days before it was killed. In that time, nothing was written to rnaQUAST.log, and nothing was written to the experiment outdirectory. Is this expected behavior? Should there be any logging, writing to disk, or anything else while Getting database coverage by reads... is running? (Just want to make sure there's nothing wrong with my install of rnaQUAST.)

How I called rnaQUAST.py:

rnaQUAST.py \
    -t "${SLURM_CPUS_ON_NODE}" \
    --labels "${label}" \
    --transcripts "${transcript}" \
    --reference "${reference}" \
    --gtf "${gtf}" \
    --gmap_index "${gmap_index}" \
    --strand_specific \
    -sam "${sam}" \
    --output_dir "${dir_out}" \
    --disable_infer_genes \
    --disable_infer_transcripts

Contents of rnaQUAST.log:

/home/kalavatt/miniconda3/envs/rnaquast_curr_env/share/rnaquast-2.2.1-0/rnaQUAST.py -t 16 --labels gg_mkc-16_mir-0.005_mg-1_gf-0.005 --transcripts ./Trinity_GG.Q_N/trinity-gg_mkc-16_mir-0.005_mg-1_gf-0.0
05.Trinity-GG.fasta --reference /home/kalavatt/genomes/sacCer3/Ensembl/108/DNA/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.chr-rename.fasta --gtf /home/kalavatt/genomes/sacCer3/Ensembl/108/gtf/Saccharo
myces_cerevisiae.R64-1-1.108.gtf --gmap_index /home/kalavatt/genomes/sacCer3/Ensembl/108/DNA/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.chr-rename.fasta.gmap --strand_specific -sam bams/WT_Q_day7_N_me
rged.UTK_prim.sam --output_dir outfiles_rnaQUAST-test_Trinity-GG_Q-N_2022-0220_run-6 --disable_infer_genes --disable_infer_transcripts

rnaQUAST: 2.2.1

System information:
  OS: Linux-4.15.0-192-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 3.7.12
  CPUs number: 24

External tools:
  matplotlib: 3.5.3
  joblib: 1.2.0
  gffutils: 0.11.1
  blastn: 2.13.0+
  makeblastdb: 2.13.0+
  gmap: 2017-11-15

Started: 2023-02-21 09:13:10

Logging to /fh/fast/tsukiyama_t/grp/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0218/outfiles_rnaQUAST-test_Trinity-GG_Q-N_2022-0220_run-6/logs/rnaQUAST.log

2023-02-21 09:13:10
Getting reference...
Done.
Using strand specific transcripts...

2023-02-21 09:13:11
Creating sqlite3 db by gffutils...
2023-02-21 09:13:14,608 - INFO - Committing changes: 41000 features
2023-02-21 09:13:14,692 - INFO - Populating features table and first-order relations: 41878 features
2023-02-21 09:13:14,700 - INFO - Creating relations(parent) index
2023-02-21 09:13:14,729 - INFO - Creating relations(child) index
2023-02-21 09:13:14,756 - INFO - Creating features(featuretype) index
2023-02-21 09:13:14,768 - INFO - Creating features (seqid, start, end) index
2023-02-21 09:13:14,786 - INFO - Creating features (seqid, start, end, strand) index
2023-02-21 09:13:14,806 - INFO - Running ANALYZE features
  saved to /fh/fast/tsukiyama_t/grp/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0218/outfiles_rnaQUAST-test_Trinity-GG_Q-N_2022-0220_run-6/Saccharomyces_cerevisiae.R64-1-1.108.db.

2023-02-21 09:13:14
Loading sqlite3 db by gffutils from /fh/fast/tsukiyama_t/grp/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0218/outfiles_rnaQUAST-test_Trinity-GG_Q-N_2022-0220_run-6/Saccharomyces_ce
revisiae.R64-1-1.108.db to memory...
Done.

2023-02-21 09:13:15
Getting GENE DATABASE metrics...
Done.

Sets maximum intron size equal 2583. Default is 1500000 bp.

2023-02-21 09:13:50
Sorting exons attributes...
  Sorted in I.
  Sorted in II.
  Sorted in III.
  Sorted in IV.
  Sorted in V.
  Sorted in VI.
  Sorted in VII.
  Sorted in VIII.
  Sorted in IX.
  Sorted in X.
  Sorted in XI.
  Sorted in XII.
  Sorted in XIII.
  Sorted in XIV.
  Sorted in XV.
  Sorted in XVI.
  Sorted in Mito.
  Sorted in I.
  Sorted in II.
  Sorted in III.
  Sorted in IV.
  Sorted in V.
  Sorted in VI.
  Sorted in VII.
  Sorted in VIII.
  Sorted in IX.
  Sorted in X.
  Sorted in XI.
  Sorted in XII.
  Sorted in XIII.
  Sorted in XIV.
  Sorted in XV.
  Sorted in XVI.
  Sorted in Mito.
  Sorted in I.
  Sorted in II.
  Sorted in III.
  Sorted in IV.
  Sorted in V.
  Sorted in VI.
  Sorted in VII.
  Sorted in VIII.
  Sorted in IX.
  Sorted in X.
  Sorted in XI.
  Sorted in XII.
  Sorted in XIII.
  Sorted in XIV.
  Sorted in XV.
  Sorted in XVI.
  Sorted in Mito.
Done.

2023-02-21 09:13:51
Getting database coverage by reads...

Time at which the job was killed:

slurmstepd-gizmoj22: error: *** STEP 11207726.0 ON gizmoj22 CANCELLED AT 2023-02-23T20:41:31 DUE TO TIME LIMIT ***
srun: error: gizmoj22: task 0: Killed
srun: Force Terminated StepId=11207726.0
kalavattam commented 1 year ago

Also, one thing I noticed is that the number of CPUs reported in rnaQUAST.log is greater than the value I assign to -t; e.g., above, where -t is 16, CPUs number in rnaQUAST.log is 24. Is this anything abnormal?

andrewprzh commented 1 year ago

Dear @kalavattam

I thnk the log remain unchanged since it waits for the STAR aligner to finish. Do you see any STAR logs or SAM files in the output folder? 1.5 days look way too much anyway, STAR is a quite fast aligner I guess.

CPUs number just reflects the total number of CPUs in the system.

Best Andrey

asan-emirsaleh commented 1 year ago

Hello @kalavattam @andrewprzh ! I encountered the same issue. RNAquast is staying on the stage Getting database coverage by reads... for 5 days. I passed the reference transcriptome (i.e. the fasta of predicted coding sequences within genome, so there is no duplicated entities except of paralogs) and reads from 9 libraries, belonged to the three experimental conditions for closely related genotype. I think there is something wrong with RNAquast, too. I have never had it working before. Now, I use a conda-bundled recent RNAquast version. It is not clear what is an aligner that is used by RNAquast at this stage, from the log messages. I ran RNAquast on 256 threads band 980 G RAM. Is there a way to make the RNAquast using the previously prepared alignments?

asan-emirsaleh commented 1 year ago

Hello! The processing has been aborted due to the time limit after 12 days of Getting database coverage by reads... on 256 threads and 980 GB RAM. I assume there be a problem in rnaQUAST. I have encountered with the same issue 2 years ago. @kalavattam also does. The issue was rarely reported, but does probably exist in the package for a years.

andrewprzh commented 1 year ago

@asan-emirsaleh

Answered in another thread.

In brief, unfortunately the problem remains there as rnaQUAST is not in the active development. I suggest not to provide reads to rnaQUAST and use other quantification pipelins.

Best Andrey