bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Question about simulating somatic mutations #113

Closed leedchou closed 1 year ago

leedchou commented 2 years ago

Hi, @litaifang

You mentioned an ideal simulated example is from sequencing replicates of the same sample. However, in my case, I got only one normal BAM for each sample. Can I use that only normal BAM as both tumor-bam-in and normal-bam-in directly to get synthetic data? Would it make an influence to calling precision in a bad way if I training model with these simulated samples rather than ideal examples?

Best regards, Chou

litaifang commented 2 years ago

You may randomly split that one normal bam file into two, and designate one of them as the "tumor." You cannot use the same bam file, because then there will be no false positive in the training data. False positives come when you have different reads for the same sample. But if you have identical reads (like using one single bam file), then there will be no false positive.

leedchou commented 2 years ago

You may randomly split that one normal bam file into two, and designate one of them as the "tumor." You cannot use the same bam file, because then there will be no false positive in the training data. False positives come when you have different reads for the same sample. But if you have identical reads (like using one single bam file), then there will be no false positive.

Much appreciation. Here's another issue I met today:

I failed pulling Docker image when trying to run _BamSimulatorsingleThread.sh. I guess it might be the security mechanism of my HPC stopped this process. Will the alternate solution that I download your docker images locally on other devices and then run simulating script work? If it will be working, is there any path that I can download your docker images?

Thanks again.

litaifang commented 2 years ago

You should be able to download docker image in another device, save the image as a .tar file, copy that .tar file to your HPC drive, and then unpack that .tar drive.

Another alternative may be to build the docker image using the docker file at https://github.com/bioinform/somaticseq/tree/master/Dockerfiles (for bamsurgeon: https://github.com/litaifang/bamsurgeon/blob/master/Dockerfiles/bamsurgeon-1.1-3.dockerfile), tag it like the image you would be using, and then your system will use that local image instead.

litaifang commented 2 years ago

Yeah Bamsurgeon can take a very long time. Why not use the multi-thread option?

-- Li Tai

http://www.chem.ucla.edu/~ltfang

On Thu, Apr 28, 2022 at 7:22 PM leed @.***> wrote:

You should be able to download docker image in another device, save the image as a .tar file, copy that .tar file to your HPC drive, and then unpack that .tar drive.

Another alternative may be to build the docker image using the docker file at https://github.com/bioinform/somaticseq/tree/master/Dockerfiles (for bamsurgeon: https://github.com/litaifang/bamsurgeon/blob/master/Dockerfiles/bamsurgeon-1.1-3.dockerfile), tag it like the image you would be using, and then your system will use that local image instead.

Thanks for your answer, I will give a try later. Before that, I did run BamSimulator successfully yesterday with the action bash. It is still running while 15 hours have passed, I wondered how much time will it spend to finish this process on a WGS data (60X, single thread)?

Best regards.

— Reply to this email directly, view it on GitHub https://github.com/bioinform/somaticseq/issues/113#issuecomment-1112820249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB44HRUMWW7L3YTS5646RP3VHNBVVANCNFSM5UOJCHGA . You are receiving this because you were mentioned.Message ID: @.***>

leedchou commented 2 years ago

Yeah Bamsurgeon can take a very long time. Why not use the multi-thread option? -- Li Tai -------------------------------------------- http://www.chem.ucla.edu/~ltfang On Thu, Apr 28, 2022 at 7:22 PM leed @.> wrote: You should be able to download docker image in another device, save the image as a .tar file, copy that .tar file to your HPC drive, and then unpack that .tar drive. Another alternative may be to build the docker image using the docker file at https://github.com/bioinform/somaticseq/tree/master/Dockerfiles (for bamsurgeon: https://github.com/litaifang/bamsurgeon/blob/master/Dockerfiles/bamsurgeon-1.1-3.dockerfile), tag it like the image you would be using, and then your system will use that local image instead. Thanks for your answer, I will give a try later. Before that, I did run BamSimulator successfully yesterday with the action bash. It is still running while 15 hours have passed, I wondered how much time will it spend to finish this process on a WGS data (60X, single thread)? Best regards. — Reply to this email directly, view it on GitHub <#113 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB44HRUMWW7L3YTS5646RP3VHNBVVANCNFSM5UOJCHGA . You are receiving this because you were mentioned.Message ID: @.>

Sorry, just returned from vacation. I've tried multi-thread option by running the following script.

HOME_PATH=/variantcall/USER/chou
REF=/source/ref/hs37d5/hs37d5.fa
REPLICATE_001=/data/1/HG001.bam
REPLICATE_002=/data/2/HG001.bam

$HOME_PATH/somaticseq/somaticseq/utilities/singularities/bamSimulator/BamSimulator_multiThreads.sh \
--genome-reference  $REF \
--tumor-bam-in      $REPLICATE_001 \
--normal-bam-in     $REPLICATE_002 \
--tumor-bam-out     syntheticTumor.bam \
--normal-bam-out    syntheticNormal.bam \
--split-proportion  0.5 \
--threads           8 \
--num-snvs          20000 \
--num-indels        8000 \
--min-vaf           0.0 \
--max-vaf           1.0 \
--left-beta         2 \
--right-beta        5 \
--min-variant-reads 2 \
--output-dir        $HOME_PATH/TN_data/simulated/HG001 \
--action            qsub

However, It did not work. When I was checking qstat, found that all these 8 queues just waited in line. I am sure these tasks were submitted to the same node on cluster, because I got only one node to use. I wondered if it is related to the number of node or CPU.

litaifang commented 2 years ago

When run in parallel, each "region" requires its own resources. Maybe your one node only has enough memory for one single task at a time?

leedchou commented 2 years ago

Yes, it seems like there is no difference in terms of runtime between parallel mode and non parallel mode in my case. I modified the scripts you posted, and it's been working in parallel with GNU parallel.

Thanks.