CCBR / CHARLIE

Circrnas in Host And viRuses anaLysis pIpEline for Detection Annotation Quantification of circRNAs
https://ccbr.github.io/CHARLIE/
MIT License
2 stars 1 forks source link

convert to platform-agnostic pipeline #99

Closed kelly-sovacool closed 2 months ago

kelly-sovacool commented 9 months ago

development in progress here: /data/CCBR_Pipeliner/Pipelines/CHARLIE/charlie-dev-sovacool

kelly-sovacool commented 8 months ago

test run command to modify

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
  -w=/data/Ziegelbauer_lab/circRNADetection/circRNA_daq_v0.10.x/samples_15 \
  -m=init \
  -g=hg38 \
  -v=NC_009333.1,KT899744.1,NC_006273.2 \
  -s /data/Ziegelbauer_lab/circRNADetection/circRNA_daq_v0.10.x/samples_15.tsv
kelly-sovacool commented 8 months ago

Created a new samples.tsv file with just 4 samples from Vishal's samples_15.tsv.

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1 \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv

Currently running on biowulf with latest release so we can compare outputs to the containerized version.

/data/Ziegelbauer_lab/Pipelines/circRNA/v0.10.1/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1 \
    -m=run
kelly-sovacool commented 7 months ago

Testing containerized version:

/data/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv
/data/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/charlie \
    -w=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev \
    -m=run -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/samples.tsv
kelly-sovacool commented 7 months ago

/usr/bin/bash: line 32: fastq-filter: command not found

need to add to cutadapt docker

Edit: fixed and renamed the container charlie_cutadapt_fqfilter

kelly-sovacool commented 7 months ago

create_index failed due to missing output files

MissingOutputException in rule create_index in file /vf/users/Ziegelbauer_lab/Pipelines/circRNA/charlie-dev-sovacool/workflow/rules/create_index.smk, line 4:
Job 0 completed successfully, but some output files are missing. Missing files after 120 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/NCLscan_index/AllRef.ndx
Removing output files of failed job create_index since they might be corrupted:
/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.genes.genepred_w_geneid, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/STAR_no_GTF/SA, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.transcripts.fa, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.dummy.fa, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/separate_fastas/separate_fastas.lst
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
kelly-sovacool commented 6 months ago

test on FRCE

/home/sovacoolkl/CHARLIE/charlie \
    -w=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/charlie_iss-99 \
    -m=init -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/samples.tsv
/home/sovacoolkl/CHARLIE/charlie \
    -w=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/charlie_iss-99 \
    -m=run -g=hg38 -v=NC_009333.1,KT899744.1,NC_006273.2 \
    -s=/scratch/cluster_scratch/sovacoolkl/charlie_dev_test/samples.tsv
kelly-sovacool commented 6 months ago

error in rule DCC

Activating singularity image /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/.snakemake/singularity/b688737477c8cf86b329e4227da72916.simg
+ '[' -d /lscratch/25273199 ']'
+ TMPDIR=/lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78
+ '[' '!' -d /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78 ']'
+ mkdir -p /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78
++ dirname /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircRNACount
+ cd /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC
+ '[' PE == PE ']'
+ DCC @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/samplesheet.txt \
    --temp /lscratch/25273199/09975c64-8e35-4c64-bd19-c0afbf581a78/DCC --threads 4 --detect --gene \
    --bam /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam \
    -ss \
    --annotation /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf \
    --chrM -G --rep_file /data/CCBR_Pipeliner/db/PipeDB/charlie/fastas_gtfs/hg38.repeats.gtf \
    --refseq /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fa \
    --PE-independent \
    -mt1 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate1.txt \
    -mt2 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate2.txt
[W::hts_idx_load3] The index file is older than the data file: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam.csi
Traceback (most recent call last):
  File "/usr/local/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.5.0', 'console_scripts', 'DCC')()
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/main.py", line 490, in main
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/main.py", line 679, in findCircSkipJunction
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
  File "/usr/local/lib/python3.8/dist-packages/DCC-0.5.0-py3.8.egg/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
ValueError: invalid literal for int() with base 10: '3"'
[Tue Apr 30 00:44:26 2024]
Error in rule dcc:
    jobid: 0
    input: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/samplesheet.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate1.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/mate2.txt, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/STAR2p/G1_Normal_p2.bam, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf
    output: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircRNACount, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/CircCoordinates, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/LinearCount, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/G1_Normal.dcc.counts_table.tsv, /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Normal/DCC/G1_Normal.dcc.counts_table.tsv.filtered
    shell:

This worked with the previous charlie version. (/data/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_v0.10.1)

Checking for differences in input files for this rule between the two runs:

The bam files

samtools stat summaries are identical

samtools stat charlie_v0.10.1/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam > G1_Tumor_p2.bam.stat.old
samtools stat charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam > G1_Tumor_p2.bam.stat.new
diff G1_Tumor_p2.bam.stat.*
3c3
< # The command line was:  stat charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam
---
> # The command line was:  stat charlie_v0.10.1/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam

The gtf files are identical

md5sum charlie_dev/ref/ref.fixed.gtf charlie_v0.10.1/ref/ref.fixed.gtf
54dcc6005272fcda13e6c46c76ec9b3d  charlie_dev/ref/ref.fixed.gtf
54dcc6005272fcda13e6c46c76ec9b3d  charlie_v0.10.1/ref/ref.fixed.gtf

The chimera files are all equal

library(tidyverse)

files <- tibble(dev = c('charlie_dev/results/G1_Tumor/STAR1p/G1_Tumor_p1.Chimeric.out.junction',
                        'charlie_dev/results/G1_Tumor/STAR1p/mate1/G1_Tumor_mate1.Chimeric.out.junction',
                        'charlie_dev/results/G1_Tumor/STAR1p/mate2/G1_Tumor_mate2.Chimeric.out.junction'),
                rel = c('charlie_v0.10.1/results/G1_Tumor/STAR1p/G1_Tumor_p1.Chimeric.out.junction',
                        'charlie_v0.10.1/results/G1_Tumor/STAR1p/mate1/G1_Tumor_mate1.Chimeric.out.junction',
                        'charlie_v0.10.1/results/G1_Tumor/STAR1p/mate2/G1_Tumor_mate2.Chimeric.out.junction'),)
files %>% pmap(\(dev, rel) all_equal(read_tsv(dev), read_tsv(rel)))
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

checking DCC & python version in conda env vs Docker

release version used conda env: https://github.com/CCBR/CHARLIE/blob/e19cd66f319655ea5c5bd4ca4481f9fdfb88a4fd/workflow/rules/findcircrna.smk#L722-L723

now using docker: https://github.com/CCBR/CHARLIE/blob/fbdb6647ad2aafa13218845f864de0b8632f5fc2/docker/dcc/Dockerfile#L12-L16

Both use v0.5.0. According to the release notes, DCC 0.5.0 requires python 3.5 and no longer supports python 2.7.

I tried having the docker container install DCC via conda, but the rule still failed with the same error.

still failing...

After rebuilding the docker to install DCC 0.5.0 from conda, it still fails with the same error as before:

Activating singularity image /data/CCBR_Pipeliner/SIFS/charlie_dcc_v0.1.0.sif
+ '[' -d /lscratch/25536525 ']'
+ TMPDIR=/lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d
+ '[' '!' -d /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d ']'
+ mkdir -p /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d
++ dirname /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/CircRNACount
+ cd /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC
+ '[' PE == PE ']'
+ DCC @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/samplesheet.txt --temp /lscratch/25536525/8e9ea0a8-9ea7-406e-ab74-605db2e6e40d/DCC --threads 4 --detect --gene --bam /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam -ss --annotation /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fixed.gtf --chrM -G --rep_file /data/CCBR_Pipeliner/db/PipeDB/charlie/fastas_gtfs/hg38.repeats.gtf --refseq /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/ref/ref.fa --PE-independent -mt1 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/mate1.txt -mt2 @/vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/DCC/mate2.txt
[W::hts_idx_load3] The index file is older than the data file: /vf/users/Ziegelbauer_lab/circRNADetection/sovacoolkl_charlie/charlie_dev/results/G1_Tumor/STAR2p/G1_Tumor_p2.bam.csi
Traceback (most recent call last):
  File "/opt2/conda/envs/dcc/bin/DCC", line 10, in <module>
    sys.exit(main())
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/main.py", line 490, in main
    CircSkipfiles = findCircSkipJunction(output_coordinates, options.tmp_dir,
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/main.py", line 679, in findCircSkipJunction
    circStartAdjacentExons, circStartAdjacentExonsIv = CCEM.findcircAdjacent(circStartExons, Custom_exon_id2Iv,
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
    interval = Custom_exon_id2Iv[self.getAdjacent(ids, start=start)]
  File "/opt2/conda/envs/dcc/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
    exon_number = int(custom_exon_id.split(':')[1]) - 1
ValueError: invalid literal for int() with base 10: '1"'

On further inspection, it looks like the DCC conda env on biowulf was built with python 2.7: /data/CCBR_Pipeliner/db/PipeDB/Conda/envs/DCC/lib/python2.7/site-packages

kelly-sovacool commented 6 months ago

errors on FRCE:

sbatch: error: invalid partition specified: ccr
sbatch: error: Batch job submission failed: Invalid partition name specified
sbatch: error: Invalid generic resource (gres) specification
Error submitting jobscript (exit code 1):

Will need to edit cluster.json and submit_script.sbatch accordingly

kelly-sovacool commented 6 months ago

Looks like the DCC devs are aware of the issue and fixed it in the master branch -- https://www.github.com/dieterich-lab/DCC/issues/103

Edited the docker container to use the dev version. It worked!

kelly-sovacool commented 6 months ago

First run-through on biowulf completed successfully after several bug fixes. Re-run from start to finish completed successfully on biowulf. Test in progress on frce.

kelly-sovacool commented 6 months ago

more problems on FRCE:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

need to reduce threads for FRCE, but I can't find how many are available per node on the norm partition https://ncifrederick.cancer.gov/staff/frce/documentation/slurm-partitions-features

just switched jobs that requested 56 threads to 32 for FRCE and jobs are running now

edit: found the FRCE hardware config here: https://ncifrederick.cancer.gov/staff/frce/documentation/frce-hardware-capabilities

kelly-sovacool commented 6 months ago

Currently running on FRCE with improved handling of config & cluster templates

kelly-sovacool commented 6 months ago

error on FRCE:

SystemExit in file /home/sovacoolkl/CHARLIE/workflow/rules/init.smk, line 20:
File: /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa does not exists!
  File "/home/sovacoolkl/CHARLIE/workflow/Snakefile", line 19, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 190, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 29, in check_readaccess
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 20, in check_existence
SystemExit in file /home/sovacoolkl/CHARLIE/workflow/rules/init.smk, line 20:
File: /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa does not exists!
  File "/home/sovacoolkl/CHARLIE/workflow/Snakefile", line 19, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 190, in <module>
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 29, in check_readaccess
  File "/home/sovacoolkl/CHARLIE/workflow/rules/init.smk", line 20, in check_existence

even though the file does exist 🤔

file /mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa

/mnt/projects/CCBR-Pipelines/db/charlie/fastas_gtfs/hg38.fa: ASCII text, with very long lines

is /mnt not available in compute nodes on FRCE??

Edit: this seems to be a FRCE regression -- tried to submit a RENEE job and that failed for the same reason

/var/spool/slurmd/job37856165/slurm_script: line 4: /mnt/projects/CCBR-Pipelines/pipelines/RENEE/renee-dev-sovacool/bin/renee: No such file or directory

Submitted a help ticket

kelly-sovacool commented 6 months ago

upgraded snakemake in the shared conda env on FRCE to v7

conda activate /mnt/projects/CCBR-Pipelines/conda/envs/snakemake
mamba install -c bioconda snakemake=7.32.4
kelly-sovacool commented 5 months ago

on FRCE, star_circrnafinder hangs indefinitely and gets cancelled by slurm, but actually completes successfully in < 3 hours when run interactively.