KCCG / ClinSV

Robust detection of clinically relevant structural and copy number variation from whole genome sequencing data
Other
61 stars 8 forks source link

ClinSV script stops at the first step # Create sample info file from bam files ... #23

Closed jordimaggi closed 2 years ago

jordimaggi commented 2 years ago

Hi,

I am testing ClinSV on a Ubuntu 20.04 VM. I pulled the docker image and tried to run the following command:

sudo docker run kccg/clinsv -r all -i $PWD/WGS/*.bam -ref $PWD/WGS/Reference_hg19/hg19.fa -p $PWD/test_run

The script seems to start correctly, but stops right away at the first task. This is the console output I get:

##############################################
####                ClinSV                ####
##############################################
# 15/03/2022 08:40:29

# clinsv dir: /app/clinsv
# projectDir: /media/analyst/Data/test_run
# sampleInfoFile: /media/analyst/Data/test_run/sampleInfo.txt 
# name stem: test_run
# lumpyBatchSize: 15
# genome reference: /media/analyst/Data/WGS/Reference_hg19/hg19.fa
# run steps: all
# number input bams: 1

# Create sample info file from bam files ...
ln -s  /media/analyst/Data/test_run/alignments//.bam

Any idea where the problem may lie?

Thanks for your help.

halessi commented 2 years ago

This is my exact problem as well, identical output using singularity.

Cluster reports job as having finished. PLEASE let's figure this out.

NOTE that if you try to run it again, it will work UNTIL a later step, when it looks for the BAM file to have been linked into alignments.

I think it's something to do with the formatting of our BAM headers?

##############################################
####                ClinSV                ####
##############################################
# 15/03/2022 09:33:04

# clinsv dir: /opt/clinsv
# projectDir: /data/LAB_FOLDER/project_folder_using_separate_data_input
# sampleInfoFile: /data/LAB_FOLDER/project_folder_using_separate_data_input/sampleInfo.txt 
# name stem: project_folder_using_separate_data_input
# lumpyBatchSize: 15
# genome reference: /data/LAB_FOLDER/clinsv/refdata-b37
# run steps: all
# number input bams: 44

# Create sample info file from bam files ...
ln -s /vf/users/LAB_FOLDER/BAMs/bqsr-cleaned-SAMPLE.bam /data/LAB_FOLDER/project_folder_using_separate_data_input/alignments/SAMPLE/SAMPLE.bam

I went and tried to see if the ln -s command worked if I ran it manually, and the file was already linked, so it ran successfully and then just quit, so I don't know what is going on.

halessi commented 2 years ago

@drmjc Any chance you have any insight on this? I think both of us are trying v1.0 (not GRCh38), but your input would be appreciated.

Thanks!!

drmjc commented 2 years ago

I think the issue is that neither V0.9 or v1.0 support hg19, and you're using v0.9. Andre is best placed to respond as he wrote it. V1.1 will support hg19, hs37d5, grch38.

On Wed, 16 Mar 2022, 12:58 am Hugh Alessi, @.***> wrote:

@drmjc https://github.com/drmjc Any chance you have any insight on this? I think both of us are trying v1.0 (not GRCh38), but your input would be appreciated.

Thanks!!

— Reply to this email directly, view it on GitHub https://github.com/KCCG/ClinSV/issues/23#issuecomment-1068020338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQQM7SRGQQ7XPJKK2OG5DVACJRZANCNFSM5QX7IXLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

halessi commented 2 years ago

Thank you for the reply.

This would make sense -- if the BAM headers or something are formatted differently w/ hg19, then it would follow that ClinSV fails to link the files (if this data were necessary or it ignores improperly formatted data).

So, in order to use hg19 I will need to wait for v1.1, is that correct?

Thanks again!

drmjc commented 2 years ago

I think so, or liftover your bam files to hs37d5, or realign to grch38 (see the other issue about refactoring clinsv). If you don't have too many files, the latter might be the best option.

On Wed, 16 Mar 2022, 5:23 am Hugh Alessi, @.***> wrote:

Thank you for the reply.

This would make sense -- if the BAM headers or something are formatted differently w/ hg19, then it would follow that ClinSV fails to link the files (if this data were necessary or it ignores improperly formatted data).

So, in order to use hg19 I will need to wait for v1.1, is that correct?

Thanks again!

— Reply to this email directly, view it on GitHub https://github.com/KCCG/ClinSV/issues/23#issuecomment-1068316502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQQM3RQU7SVGE27M6G3U3VADITBANCNFSM5QX7IXLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

halessi commented 2 years ago

Update: was able to fix the linking issue at the start of ClinSV by fixing my bam.bai files --> I had the .bam.bai files soft linked to the .bai, which ClinSV didn't like. By creating hard links from .bai to .bam.bai files, I was able to resolve this issue.

drmjc commented 2 years ago

how intriguing, thanks for the update.

@J-Bradlee, please note this & we should test with

  1. test.bam + test.bai
  2. test.bam + test.bam.bai

Both forms of naming the bai index file are acceptable in practice (even though the SAM specs don't define this).

halessi commented 2 years ago

@drmjc -- just a quick question. Does annotation often take upwards of 4+ days? For 45 BAMs, my annotation phase has been going for 4.5 days at this point. Not sure if that's expected or not (200gb RAM, 32 CPUs).

Thank you!

Hugh

J-Bradlee commented 2 years ago

Hi @halessi, thought I would jump in here and say that for a single 72gb BAM file it took at least 24 hours to run through all of ClinSV's steps on a similarly spec machine as yours. It also took around 6 hours to finish all the steps for a single 6gb BAM file. Roughly what is the total size for all 45 of your BAM files?

halessi commented 2 years ago

@J-Bradlee Thanks so much for your reply.

I would guess about ~650GB would be the total size for all BAM files. Maybe this was too large of a run? I would estimate total running time at this point for all steps to be in the 10 day range, so perhaps I should have split this up more effectively...

Anyways, I guess it sounds like this amount of time isn't crazy. But I'm a little worried it's going to take like 20 days at this point...

Can you speak a bit more on the distribution of time? I.e., for your 72gb BAM run, was the majority of it during lumpy/cnvator?

Note that I originally provided ClinSV with even more resources (64 cpus, I think 400gb of RAM?) but the job was killed due to a cluster error, and it didn't seem like ClinSV was even eating up anywhere near that much, so I cut it back for resuming the job.

Thank you!

Hugh

J-Bradlee commented 2 years ago

No problem @halessi .

Most of the time is spent on the bigwig step followed by the annotation and then CNVnator steps. Below is my output of a successful run for a subsampled 6gb BAM file. Hopefully it can give you a rough idea of how long it would take for your BAM files.

Note this is being used with ClinSV v1.0 with reference genome b38. However I think it should give similar duration to v0.9's b37 ref genome

##############################################
####                ClinSV                ####
##############################################
# 28/03/2022 18:25:00

# clinsv dir: /app/clinsv
# projectDir: /app/project_folder
# sampleInfoFile: /app/project_folder/sampleInfo.txt 
# name stem: project_folder
# lumpyBatchSize: 5
# genome reference: /app/ref-data/refdata-b38
# run steps: all
# number input bams: 1

# Create sample info file from bam files ...
ln -s /app/input/NA12878.grch38.subsampled.bam /app/project_folder/alignments/FR05812606/FR05812606.bam
ln -s /app/input/NA12878.grch38.subsampled.bam.bai /app/project_folder/alignments/FR05812606/FR05812606.bam.bai
# Read Sample Info from /app/project_folder/sampleInfo.txt
# use: FR05812606       H7LH3CCXX_6             /app/input/NA12878.grch38.subsampled.bam
# 1 samples to process
# If not, please exit make a copy of sampleInfo.txt, modify it and rerun with -s sampleInfo_mod.txt pointing to the new sample info file. 

###### Generate the commands and scripts ######

# bigwig

# lumpy

# cnvnator

# annotate

# prioritize

# qc

###### Run jobs ######

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.createWigs.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.createWigs.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 01:31:33
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q0.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q0.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:37:20
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q20.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.q20.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:36:05
 ### exist status: 0

 ### executing: sh /app/project_folder/alignments/FR05812606/bw/sh/bigwig.mq.FR05812606.sh &> /app/project_folder/alignments/FR05812606/bw/sh/bigwig.mq.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:37:10
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/FR05812606/lumpy/sh/lumpy.preproc.FR05812606.sh &> /app/project_folder/SVs/FR05812606/lumpy/sh/lumpy.preproc.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:12:51
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...  

 ### finished after (hh:mm:ss): 00:26:51
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.depth.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.depth.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:54
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/FR05812606/cnvnator/sh/cnvnator.caller.FR05812606.sh &> /app/project_folder/SVs/FR05812606/cnvnator/sh/cnvnator.caller.FR05812606.e  ...  

 ### finished after (hh:mm:ss): 00:56:31
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/sh/annotate.main.joined.sh &> /app/project_folder/SVs/joined/sh/annotate.main.joined.e  ...  

 ### finished after (hh:mm:ss): 01:27:03
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/joined/sh/prioritize.main.joined.sh &> /app/project_folder/SVs/joined/sh/prioritize.main.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:07
 ### exist status: 0

 ### executing: sh /app/project_folder/SVs/qc/sh/qc.main.joined.sh &> /app/project_folder/SVs/qc/sh/qc.main.joined.e  ...  

 ### finished after (hh:mm:ss): 00:00:48
 ### exist status: 0

# 29/03/2022 00:52:13 Project project_folder project_folder | Total jobs 11 | Remaining jobs 0 | Remaining steps bigwig,lumpy,cnvnator,annotate,prioritize,qc  11 | Total time: 386 min

# 29/03/2022 00:52:13 Project project_folder project_folder | Total jobs 11 | Remaining jobs 0 | Remaining steps   0 | Total time: 386 min

# Everything done! Exit

# writing igv session files...

xml file: /app/project_folder/igv/FR05812606.xml

I also want to add, that you may experience even slower times for the CNVnator section as the job resources are hard coded to 16 cpus and 30 gb of memory. See the source code line here. So it is not using all the resources that are available to it.