KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
68 stars 16 forks source link

Missing chr20.germline.vcf from germline calling step #55

Open briansha opened 2 months ago

briansha commented 2 months ago

I'm using the example data from the documentation.

In v1.6.0 - the file chr20.germline.vcf does not get generated from the germline calling step.

${path}/src/Monopogen.py  germline  -a ${path}/apps  -r region.lst \
 -p ./  -t 22 \
 -g  GRCh38.chr20.fa  -m 3 -s all  -o bm

This file does, however, get generated in v1.0 of monopogen.

Since I'm running the Somatic SNV calling from scRNA-seq piece of Monopogen, the absence of this file seems to generate further issues downstream.

This needs fixed. In the meantime, I'll just have to try v1.5.0 and see if this issue persists.

jinzhuangdou commented 2 months ago

Could you show the log file when you run v1.6.0? Thanks

briansha commented 2 months ago

Sure. It won't show much.

I found that v1.5.0 actually does produce the chr20.germline.vcf within the germline/ output dir when the germline calling step finishes.

+ python /opt/tools/Monopogen/src/Monopogen.py preProcess --bamFile bam.lst --out bm --app-path /opt/tools/Monopogen/apps --nthreads 1
[2024-05-03 23:36:52,368] INFO     Monopogen.py Performing data preprocess before variant calling...
[2024-05-03 23:36:52,368] INFO     germline.py Parameters in effect:
[2024-05-03 23:36:52,369] INFO     germline.py --subcommand = [preProcess]
[2024-05-03 23:36:52,369] INFO     germline.py --bamFile = [bam.lst]
[2024-05-03 23:36:52,369] INFO     germline.py --out = [bm]
[2024-05-03 23:36:52,370] INFO     germline.py --app_path = [/opt/tools/Monopogen/apps]
[2024-05-03 23:36:52,370] INFO     germline.py --max_mismatch = [3]
[2024-05-03 23:36:52,370] INFO     germline.py --nthreads = [1]
[2024-05-03 23:36:52,483] DEBUG    Monopogen.py PreProcessing sample bm
[2024-05-03 23:36:52,573] INFO     germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:52,882] INFO     germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:52,918] INFO     germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:52,956] INFO     germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:52,980] INFO     germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,018] INFO     germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,043] INFO     germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,095] INFO     germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,124] INFO     germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,163] INFO     germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,197] INFO     germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,251] INFO     germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,275] INFO     germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,307] INFO     germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,333] INFO     germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,364] INFO     germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,400] INFO     germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,442] INFO     germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,475] INFO     germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:36:53,505] INFO     germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:40:40,381] INFO     germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:40:40,424] INFO     germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it 
[2024-05-03 23:40:40,482] INFO     Monopogen.py Success! See instructions above.
+ echo chr20
+ less region.lst
chr20
+ python /opt/tools/Monopogen/src/Monopogen.py germline --app-path /opt/tools/Monopogen/apps --region region.lst --imputation-panel ./ --nthreads 1 --reference GRCh38.chr20.fa --max-softClipped 3 --step all --out bm
[2024-05-03 23:40:42,870] INFO     Monopogen.py Performing germline variant calling...
[2024-05-03 23:40:42,871] INFO     germline.py Parameters in effect:
[2024-05-03 23:40:42,871] INFO     germline.py --subcommand = [germline]
[2024-05-03 23:40:42,872] INFO     germline.py --region = [region.lst]
[2024-05-03 23:40:42,872] INFO     germline.py --step = [all]
[2024-05-03 23:40:42,872] INFO     germline.py --out = [bm]
[2024-05-03 23:40:42,872] INFO     germline.py --reference = [GRCh38.chr20.fa]
[2024-05-03 23:40:42,873] INFO     germline.py --imputation_panel = [./]
[2024-05-03 23:40:42,873] INFO     germline.py --max_softClipped = [3]
[2024-05-03 23:40:42,873] INFO     germline.py --app_path = [/opt/tools/Monopogen/apps]
[2024-05-03 23:40:42,873] INFO     germline.py --nthreads = [1]
[2024-05-03 23:40:42,873] INFO     germline.py --norun = [FALSE]
[2024-05-03 23:40:42,874] INFO     Monopogen.py Checking existence of essenstial resource files...
[2024-05-03 23:40:42,875] INFO     Monopogen.py Checking dependencies...
[fai_load] build FASTA index.
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Lines   total/split/realigned/skipped:  10933032/105880/22356/0
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.2ucQGM
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 11:56 PM EDT on 03 May 2024

Command line: java -Xmx19797m -jar beagle.jar
  gl=bm/germline/chr20.gl.vcf.gz
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.gp
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb

reference samples:    3202
target samples:          1

Window 1 [ chr20:273372-39851321 ]
reference markers:   10486
target markers:      10486

Starting burn-in iterations

Window=1 Iteration=1
Time for building model:         1 minute 7 seconds
Time for sampling (singles):     10 seconds
DAG statistics
mean edges/level: 49     max edges/level: 129
mean edges/node:  1.183  mean count/edge: 131

Window=1 Iteration=2
Time for building model:         1 minute 11 seconds
Time for sampling (singles):     9 seconds
DAG statistics
mean edges/level: 48     max edges/level: 124
mean edges/node:  1.167  mean count/edge: 133

Window=1 Iteration=3
Time for building model:         1 minute 6 seconds
Time for sampling (singles):     11 seconds
DAG statistics
mean edges/level: 49     max edges/level: 125
mean edges/node:  1.164  mean count/edge: 131

Window=1 Iteration=4
Time for building model:         1 minute 11 seconds
Time for sampling (singles):     9 seconds
DAG statistics
mean edges/level: 48     max edges/level: 122
mean edges/node:  1.166  mean count/edge: 133

Window=1 Iteration=5
Time for building model:         1 minute 6 seconds
Time for sampling (singles):     12 seconds
DAG statistics
mean edges/level: 48     max edges/level: 119
mean edges/node:  1.167  mean count/edge: 133

Window=1 Iteration=6
Time for building model:         1 minute 8 seconds
Time for sampling (singles):     15 seconds
DAG statistics
mean edges/level: 48     max edges/level: 121
mean edges/node:  1.166  mean count/edge: 133

Window=1 Iteration=7
Time for building model:         1 minute 7 seconds
Time for sampling (singles):     14 seconds
DAG statistics
mean edges/level: 48     max edges/level: 136
mean edges/node:  1.165  mean count/edge: 133

Window=1 Iteration=8
Time for building model:         1 minute 11 seconds
Time for sampling (singles):     18 seconds
DAG statistics
mean edges/level: 48     max edges/level: 125
mean edges/node:  1.167  mean count/edge: 133

Window=1 Iteration=9
Time for building model:         1 minute 2 seconds
Time for sampling (singles):     16 seconds
DAG statistics
mean edges/level: 48     max edges/level: 124
mean edges/node:  1.165  mean count/edge: 133

Window=1 Iteration=10
Time for building model:         1 minute 10 seconds
Time for sampling (singles):     15 seconds
DAG statistics
mean edges/level: 48     max edges/level: 127
mean edges/node:  1.167  mean count/edge: 133

Number of markers:               10486
Total time for building model: 11 minutes 19 seconds
Total time for sampling:       2 minutes 9 seconds
Total run time:                19 minutes 45 seconds

End time: 12:16 AM EDT on 04 May 2024
beagle.27Jul16.86a.jar (version 4.1) finished
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.2ucQGM
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:16 AM EDT on 04 May 2024

Command line: java -Xmx19797m -jar beagle.jar
  gt=bm/germline/chr20.germline.vcf
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.phased
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb

reference samples:    3202
target samples:          1

Window 1 [ chr20:273372-39851321 ]
reference markers:    9133
target markers:       9133

Starting burn-in iterations

Window=1 Iteration=1
Time for building model:         56 seconds
Time for sampling (singles):     1 second
DAG statistics
mean edges/level: 48     max edges/level: 125
mean edges/node:  1.200  mean count/edge: 133

Window=1 Iteration=2
Time for building model:         55 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 128
mean edges/node:  1.184  mean count/edge: 136

Window=1 Iteration=3
Time for building model:         49 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 121
mean edges/node:  1.184  mean count/edge: 136

Window=1 Iteration=4
Time for building model:         57 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 122
mean edges/node:  1.186  mean count/edge: 136

Window=1 Iteration=5
Time for building model:         52 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 122
mean edges/node:  1.184  mean count/edge: 136

Window=1 Iteration=6
Time for building model:         57 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 48     max edges/level: 125
mean edges/node:  1.183  mean count/edge: 133

Window=1 Iteration=7
Time for building model:         56 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 122
mean edges/node:  1.184  mean count/edge: 136

Window=1 Iteration=8
Time for building model:         58 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 118
mean edges/node:  1.185  mean count/edge: 136

Window=1 Iteration=9
Time for building model:         54 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 48     max edges/level: 118
mean edges/node:  1.182  mean count/edge: 133

Window=1 Iteration=10
Time for building model:         1 minute 8 seconds
Time for sampling (singles):     0 seconds
DAG statistics
mean edges/level: 47     max edges/level: 124
mean edges/node:  1.185  mean count/edge: 136

Number of markers:                9133
Total time for building model: 9 minutes 22 seconds
Total time for sampling:       2 seconds
Total run time:                15 minutes 45 seconds

End time: 12:32 AM EDT on 04 May 2024
beagle.27Jul16.86a.jar (version 4.1) finished
bash bm/Script/runGermline_chr20.sh
['bash bm/Script/runGermline_chr20.sh']
[2024-05-04 00:32:29,071] INFO     Monopogen.py Success! See instructions above.
jinzhuangdou commented 2 months ago

Could you show some log files when you run v1.6.0? Are *.gl.vcf.gz file is empty (i.e., no variants included)?

briansha commented 2 months ago

No the files aren't empty. They seem to be the same size in the germline/ dir as v1.0.0 leading up to the last step of Somatic SNV calling from scRNA-seq piece of Monopogen mentioned in the Table of Contents in the documentation.

It's just that the chr20.germline.vcf file is missing in the germline/ dir. The logs state it was used in some step, but that can't be(?) because it was never generated(?).

Of course, if v1.6.0 works with/without the presence of this file chr20.germline.vcf, then fine. But something is causing issues further downstream when I begin to use real data with this pipeline instead of the example data on the Github.

This is the entire code I'm using - based off the Somatic SNV calling from scRNA-seq piece of Monopogen section. Below, I've also included the entire log file. I've been using WDL and Docker containers.

I've been using this exact same code and pipeline for v1.0 with no problems (other than how slow it is), but for v1.6.0 there are issues.

# Preprocess
path="/rsrch3/scratch/bcb/jdou1/scAncestry/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python  ${path}/src/Monopogen.py  preProcess -b bam.lst -o bm  -a ${path}/apps -t 1

# Germline Calling
${path}/src/Monopogen.py  germline  -a ${path}/apps  -r region.lst \
 -p ./  -t 22 \
 -g  GRCh38.chr20.fa  -m 3 -s all  -o bm

# LD refinement
python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1 \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s featureInfo     \
    -g   GRCh38.chr20.fa

python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1  \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s cellScan     \
    -g   GRCh38.chr20.fa

python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1 \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s LDrefinement     \
    -g   GRCh38.chr20.fa
+ path=/opt/tools/Monopogen
+ export LD_LIBRARY_PATH=:/opt/tools/Monopogen/apps:/opt/tools/Monopogen/apps
+ LD_LIBRARY_PATH=:/opt/tools/Monopogen/apps:/opt/tools/Monopogen/apps
+ chmod 700 /opt/tools/Monopogen/apps/bcftools /opt/tools/Monopogen/apps/beagle.08Feb22.fa4.jar /opt/tools/Monopogen/apps/beagle.27Jul16.86a.jar /opt/tools/Monopogen/apps/bgzip /opt/tools/Monopogen/apps/gzip /opt/tools/Monopogen/apps/java /opt/tools/Monopogen/apps/libbz2.so.1.0 /opt/tools/Monopogen/apps/libcrypto.so.1.0.0 /opt/tools/Monopogen/apps/libdeflate.so /opt/tools/Monopogen/apps/picard.jar /opt/tools/Monopogen/apps/samtools /opt/tools/Monopogen/apps/tabix /opt/tools/Monopogen/apps/vcftools
++ basename chr20.maester_scRNA.bam
+ echo bm,chr20.maester_scRNA.bam
+ less bam.lst
bm,chr20.maester_scRNA.bam
+ cp chr20.maester_scRNA.bam .
+ cp chr20.maester_scRNA.bam.bai .
+ cp CB_7K.maester_scRNA.csv .
+ cp GRCh38.chr20.fa .
+ cp CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz .
+ python /opt/tools/Monopogen/src/Monopogen.py preProcess --bamFile bam.lst --out bm --app-path /opt/tools/Monopogen/apps --nthreads 64
[2024-05-07 23:23:44,597] INFO Monopogen.py Performing data preprocess before variant calling...
[2024-05-07 23:23:44,598] INFO germline.py Parameters in effect:
[2024-05-07 23:23:44,598] INFO germline.py --subcommand = [preProcess]
[2024-05-07 23:23:44,598] INFO germline.py --bamFile = [bam.lst]
[2024-05-07 23:23:44,598] INFO germline.py --out = [bm]
[2024-05-07 23:23:44,598] INFO germline.py --app_path = [/opt/tools/Monopogen/apps]
[2024-05-07 23:23:44,598] INFO germline.py --max_mismatch = [3]
[2024-05-07 23:23:44,598] INFO germline.py --nthreads = [64]
[2024-05-07 23:23:44,603] DEBUG Monopogen.py PreProcessing sample bm
[2024-05-07 23:23:44,795] INFO germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,795] INFO germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,796] INFO germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,796] INFO germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,796] INFO germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,797] INFO germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,796] INFO germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,797] INFO germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,797] INFO germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,797] INFO germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,798] INFO germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,798] INFO germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,799] INFO germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:23:44,800] INFO germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it
[2024-05-07 23:26:48,599] INFO Monopogen.py Success! See instructions above.
+ echo chr20
+ less region.lst
+ python /opt/tools/Monopogen/src/Monopogen.py germline --app-path /opt/tools/Monopogen/apps --region region.lst --imputation-panel ./ --nthreads 64 --reference GRCh38.chr20.fa --max-softClipped 3 --step all --out bm
chr20
[2024-05-07 23:26:49,143] INFO Monopogen.py Performing germline variant calling...
[2024-05-07 23:26:49,143] INFO germline.py Parameters in effect:
[2024-05-07 23:26:49,143] INFO germline.py --subcommand = [germline]
[2024-05-07 23:26:49,143] INFO germline.py --region = [region.lst]
[2024-05-07 23:26:49,143] INFO germline.py --step = [all]
[2024-05-07 23:26:49,143] INFO germline.py --out = [bm]
[2024-05-07 23:26:49,143] INFO germline.py --reference = [GRCh38.chr20.fa]
[2024-05-07 23:26:49,143] INFO germline.py --imputation_panel = [./]
[2024-05-07 23:26:49,143] INFO germline.py --max_softClipped = [3]
[2024-05-07 23:26:49,143] INFO germline.py --app_path = [/opt/tools/Monopogen/apps]
[2024-05-07 23:26:49,144] INFO germline.py --nthreads = [64]
[2024-05-07 23:26:49,144] INFO germline.py --norun = [FALSE]
[2024-05-07 23:26:49,144] INFO Monopogen.py Checking existence of essenstial resource files...
[2024-05-07 23:26:49,144] INFO Monopogen.py Checking dependencies...
[fai_load] build FASTA index.
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Lines total/split/realigned/skipped: 10933032/105880/22356/0
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.fNLpYx
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 11:34 PM EDT on 07 May 2024

Command line: java -Xmx20480m -jar beagle.jar
gl=bm/germline/chr20.gl.vcf.gz
ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=bm/germline/chr20.gp
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0

No genetic map is specified: using 1 cM = 1 Mb

reference samples: 3202
target samples: 1

Window 1 [ chr20:273372-39851321 ]
reference markers: 10486
target markers: 10486

Starting burn-in iterations

Window=1 Iteration=1
Time for building model: 39 seconds
Time for sampling (singles): 8 seconds
DAG statistics
mean edges/level: 48 max edges/level: 123
mean edges/node: 1.180 mean count/edge: 133

Window=1 Iteration=2
Time for building model: 43 seconds
Time for sampling (singles): 8 seconds
DAG statistics
mean edges/level: 48 max edges/level: 120
mean edges/node: 1.165 mean count/edge: 133

Window=1 Iteration=3
Time for building model: 39 seconds
Time for sampling (singles): 8 seconds
DAG statistics
mean edges/level: 48 max edges/level: 130
mean edges/node: 1.165 mean count/edge: 133

Window=1 Iteration=4
Time for building model: 43 seconds
Time for sampling (singles): 8 seconds
DAG statistics
mean edges/level: 48 max edges/level: 119
mean edges/node: 1.165 mean count/edge: 133

Window=1 Iteration=5
Time for building model: 39 seconds
Time for sampling (singles): 9 seconds
DAG statistics
mean edges/level: 48 max edges/level: 124
mean edges/node: 1.166 mean count/edge: 133

Window=1 Iteration=6
Time for building model: 47 seconds
Time for sampling (singles): 13 seconds
DAG statistics
mean edges/level: 48 max edges/level: 124
mean edges/node: 1.167 mean count/edge: 133

Window=1 Iteration=7
Time for building model: 40 seconds
Time for sampling (singles): 12 seconds
DAG statistics
mean edges/level: 48 max edges/level: 130
mean edges/node: 1.167 mean count/edge: 133

Window=1 Iteration=8
Time for building model: 44 seconds
Time for sampling (singles): 12 seconds
DAG statistics
mean edges/level: 48 max edges/level: 133
mean edges/node: 1.166 mean count/edge: 133

Window=1 Iteration=9
Time for building model: 40 seconds
Time for sampling (singles): 12 seconds
DAG statistics
mean edges/level: 48 max edges/level: 132
mean edges/node: 1.166 mean count/edge: 133

Window=1 Iteration=10
Time for building model: 44 seconds
Time for sampling (singles): 13 seconds
DAG statistics
mean edges/level: 48 max edges/level: 126
mean edges/node: 1.167 mean count/edge: 133

Number of markers: 10486
Total time for building model: 6 minutes 58 seconds
Total time for sampling: 1 minute 43 seconds
Total run time: 9 minutes 40 seconds

End time: 11:44 PM EDT on 07 May 2024
beagle.27Jul16.86a.jar (version 4.1) finished
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.fNLpYx
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 11:44 PM EDT on 07 May 2024

Command line: java -Xmx20480m -jar beagle.jar
gt=bm/germline/chr20.germline.vcf
ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=bm/germline/chr20.phased
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0

No genetic map is specified: using 1 cM = 1 Mb

reference samples: 3202
target samples: 1

Window 1 [ chr20:273372-39851321 ]
reference markers: 9129
target markers: 9129

Starting burn-in iterations

Window=1 Iteration=1
Time for building model: 31 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 48 max edges/level: 119
mean edges/node: 1.201 mean count/edge: 133

Window=1 Iteration=2
Time for building model: 34 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 127
mean edges/node: 1.184 mean count/edge: 136

Window=1 Iteration=3
Time for building model: 31 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 48 max edges/level: 122
mean edges/node: 1.183 mean count/edge: 133

Window=1 Iteration=4
Time for building model: 33 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 125
mean edges/node: 1.184 mean count/edge: 136

Window=1 Iteration=5
Time for building model: 32 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 48 max edges/level: 123
mean edges/node: 1.183 mean count/edge: 133

Window=1 Iteration=6
Time for building model: 34 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 120
mean edges/node: 1.185 mean count/edge: 136

Window=1 Iteration=7
Time for building model: 32 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 48 max edges/level: 128
mean edges/node: 1.182 mean count/edge: 133

Window=1 Iteration=8
Time for building model: 34 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 133
mean edges/node: 1.186 mean count/edge: 136

Window=1 Iteration=9
Time for building model: 32 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 118
mean edges/node: 1.183 mean count/edge: 136

Window=1 Iteration=10
Time for building model: 33 seconds
Time for sampling (singles): 0 seconds
DAG statistics
mean edges/level: 47 max edges/level: 122
mean edges/node: 1.186 mean count/edge: 136

Number of markers: 9129
Total time for building model: 5 minutes 25 seconds
Total time for sampling: 1 second
Total run time: 6 minutes 25 seconds

End time: 11:51 PM EDT on 07 May 2024
beagle.27Jul16.86a.jar (version 4.1) finished
bash bm/Script/runGermline_chr20.sh
[2024-05-07 23:51:02,765] INFO Monopogen.py Success! See instructions above.
['bash bm/Script/runGermline_chr20.sh']

Listing bm germline folder contents:
+ echo 'Listing bm germline folder contents:'
+ ls bm/germline
chr20.gl.vcf.gz
chr20.gp.log
chr20.gp.vcf.gz
chr20.phased.log
chr20.phased.vcf.gz
+ /opt/tools/Monopogen/apps/tabix -p vcf bm/germline/chr20.gl.vcf.gz
+ /opt/tools/Monopogen/apps/tabix -p vcf bm/germline/chr20.gp.vcf.gz
+ /opt/tools/Monopogen/apps/tabix -p vcf bm/germline/chr20.phased.vcf.gz
+ python /opt/tools/Monopogen/src/Monopogen.py somatic --app-path /opt/tools/Monopogen/apps --region region.lst --nthreads 64 --input-folder bm --barcode CB_7K.maester_scRNA.csv --step featureInfo --reference GRCh38.chr20.fa
[2024-05-08 00:07:31,757] INFO Monopogen.py Get feature information from sequencing data...
[2024-05-08 00:07:57,145] INFO Monopogen.py Success! See instructions above.
+ python /opt/tools/Monopogen/src/Monopogen.py somatic --app-path /opt/tools/Monopogen/apps --region region.lst --nthreads 64 --input-folder bm --barcode CB_7K.maester_scRNA.csv --step cellScan --reference GRCh38.chr20.fa
[2024-05-08 00:07:57,662] INFO Monopogen.py Collect single cell level information from sequencing data...
scanning read 1000000
scanning read 2000000
2087076:2087032:1323435:525996:301903:1023045
[2024-05-08 00:21:18,543] INFO Monopogen.py Success! See instructions above.

+ python /opt/tools/Monopogen/src/Monopogen.py somatic --app-path /opt/tools/Monopogen/apps --region region.lst --nthreads 64 --input-folder bm --barcode CB_7K.maester_scRNA.csv --step LDrefinement --reference GRCh38.chr20.fa
[2024-05-08 00:29:25,298] INFO Monopogen.py Run LD refinement ...
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Rscript /opt/tools/Monopogen/apps/../src/LDrefinement.R bm/somatic/chr20.gl.filter.hc.cell.mat.gz bm/somatic/ chr20
[2024-05-08 00:31:14,101] INFO Monopogen.py Success! See instructions above.
jinzhuangdou commented 2 months ago

The chr20.germline.vcf is a temporary file and removed since it is further phased in chr20.phased.vcf.gz. This is the final output in germline module

jinzhuangdou commented 2 months ago

Based on the log file, it seems you have finished somatic calling on chr20 example using v1.60 successfully, am I wrong?

briansha commented 2 months ago

I see - thank you for clarifying the circumstances of the chr20.germline.vcf file.

I have checked the CSV and RDS files of interest in the somatic/ dir coming off the pipeline betweeen v1.0 and v1.6 of Monopogen.

# v1.0
48K     chr20.putativeSNVs.csv
224K    chr20.SNV_mat.RDS

# v1.6
44K     monopogen/bm/somatic/chr20.putativeSNVs.csv
348K    monopogen/bm/somatic/chr20.SNV_mat.RDS

The file sizes are different. Checking the statistics of the files themselves, they share around 25% overlap of data having the exact same chr_pos_ref_alt.

So v1.0 and v1.6 produce different outputs. However, it seems v1.6 produces better, more realistic calls.

So, just to confirm, this is still the correct process for Somatic SNV calling from scRNA-seq piece of Monopogen for both v1.0 and v1.6 of Monopogen?

# Preprocess
path="/rsrch3/scratch/bcb/jdou1/scAncestry/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python  ${path}/src/Monopogen.py  preProcess -b bam.lst -o bm  -a ${path}/apps -t 1

# Germline Calling
${path}/src/Monopogen.py  germline  -a ${path}/apps  -r region.lst \
 -p ./  -t 22 \
 -g  GRCh38.chr20.fa  -m 3 -s all  -o bm

# LD refinement
python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1 \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s featureInfo     \
    -g   GRCh38.chr20.fa

python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1  \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s cellScan     \
    -g   GRCh38.chr20.fa

python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 1 \
    -i  bm  -l  CB_7K.maester_scRNA.csv   -s LDrefinement     \
    -g   GRCh38.chr20.fa
jinzhuangdou commented 2 months ago

The workflow is correct for somatic calling.