KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
68 stars 16 forks source link

Issues in the test file chr20.maester_scRNA.bam for somatic mutation calling #30

Open alextidd opened 7 months ago

alextidd commented 7 months ago

Hi, Thanks so much for developing this tool! I've been trying to get the test dataset running to call somatic mutations, but have been having some trouble. I downloaded the file chr20.maester_scRNA.bam from https://drive.google.com/file/d/1nS2rjrab-QSiq-FhpTWOtJesCE9iS_0k/view?usp=share_link, but it appears to be broken.

I ran the code in this directory:

$ ls 
bam.lst
CB_7K.maester_scRNA.csv
CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chr20_2Mb.hg38.fa
chr20_2Mb.hg38.fa.fai
chr20.maester_scRNA.bam
chr20.maester_scRNA.bam.bai
region.lst

The downloaded chr20.maester_scRNA.bam file looks like this:

$ samtools view chr20.maester_scRNA.bam | head -2

SRR15598776.294805248_TCTTGCGCAGGCACAC_CTCTCCTCGCCA 2064    20  60218   0   10H20M61H   *   0   0   GCAATCCTTTCCTCTCCATT    ,,:,:,,,,,,,,,,,,::F    NM:i:0  MD:Z:20 AS:i:20 XS:i:19 SA:Z:5,113545723,-,22S37M32S,0,0;5,156763440,+,22S19M50S,0,0;   XA:Z:11,+88315939,60S19M12S,0;
SRR15598776.80450257_AATGAAGTCGCGGACT_AAGTATACGCTC  16  20  61569   0   43S19M29S   *   0   0   TATACATATGTCCCCTCCCTCTTGAATCTCACCTTCTACCCCCGACTCCATTCCACTCCTCTAGGTTGTCACAGAGCACCGCATTTGGCTC FF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:19 AS:i:19 XS:i:19 SA:Z:5,64380457,-,19S19M53S,0,0;15,76070764,-,5S19M67S,0,0;X,10097594,-,67S19M5S,0,0;   XA:Z:Y,-10959620,44S19M28S,0;KI270736.1,+62338,28S19M44S,0;

$ samtools view chr20.maester_scRNA.bam | tail -2

SRR15598776.418134198_TTTCATGAGCAACTTC_TCACGTTTCCTA 2064    20  39999757    0   25H24M42H   *   0   0   ATATTTAAAATTTTATTTTTTAAT    ,,:FF,:,FF,,::FF,,,FF,:,    NM:i:0  MD:Z:24 AS:i:24 XS:i:23 SA:Z:X,46457624,-,3M1I25M62S,0,1;6,124139067,-,53S23M15S,0,0;7,129403981,-,66S22M3S,0,0;8,102417610,+,30S20M41S,0,0;
SRR15598776.180926021_GACTCAAAGTTCACTG_ACCATATAAAAT 2048    20  39999869    0   70H21M  *   0   0   ATTTACTATAAAAATAGTTAC   F,,:,,,,,:::,,,,,,,,,   NM:i:1  MD:Z:20T0   AS:i:20 XS:i:20 SA:Z:1,82576961,-,21S39M31S,0,0;4,180292349,+,20S19M52S,0,0;    XA:Z:5,-14292491,7S25M59S,1;11,+95173927,67S19M5S,0;14,-21926828,8S19M64S,0;

$ samtools view -H chr20.maester_scRNA.bam

@HD VN:1.3  SO:coordinate
@SQ SN:1    LN:248956422
@SQ SN:10   LN:133797422
@SQ SN:11   LN:135086622
@SQ SN:12   LN:133275309
@SQ SN:13   LN:114364328
@SQ SN:14   LN:107043718
@SQ SN:15   LN:101991189
@SQ SN:16   LN:90338345
@SQ SN:17   LN:83257441
@SQ SN:18   LN:80373285
@SQ SN:19   LN:58617616
@SQ SN:2    LN:242193529
@SQ SN:20   LN:64444167
@SQ SN:21   LN:46709983
@SQ SN:22   LN:50818468
@SQ SN:3    LN:198295559
@SQ SN:4    LN:190214555
@SQ SN:5    LN:181538259
@SQ SN:6    LN:170805979
@SQ SN:7    LN:159345973
@SQ SN:8    LN:145138636
@SQ SN:9    LN:138394717
@SQ SN:MT   LN:16569
@SQ SN:X    LN:156040895
@SQ SN:Y    LN:57227415
@SQ SN:KI270728.1   LN:1872759
@SQ SN:KI270727.1   LN:448248
@SQ SN:KI270442.1   LN:392061
@SQ SN:KI270729.1   LN:280839
@SQ SN:GL000225.1   LN:211173
@SQ SN:KI270743.1   LN:210658
@SQ SN:GL000008.2   LN:209709
@SQ SN:GL000009.2   LN:201709
@SQ SN:KI270747.1   LN:198735
@SQ SN:KI270722.1   LN:194050
@SQ SN:GL000194.1   LN:191469
@SQ SN:KI270742.1   LN:186739
@SQ SN:GL000205.2   LN:185591
@SQ SN:GL000195.1   LN:182896
@SQ SN:KI270736.1   LN:181920
@SQ SN:KI270733.1   LN:179772
@SQ SN:GL000224.1   LN:179693
@SQ SN:GL000219.1   LN:179198
@SQ SN:KI270719.1   LN:176845
@SQ SN:GL000216.2   LN:176608
@SQ SN:KI270712.1   LN:176043
@SQ SN:KI270706.1   LN:175055
@SQ SN:KI270725.1   LN:172810
@SQ SN:KI270744.1   LN:168472
@SQ SN:KI270734.1   LN:165050
@SQ SN:GL000213.1   LN:164239
@SQ SN:GL000220.1   LN:161802
@SQ SN:KI270715.1   LN:161471
@SQ SN:GL000218.1   LN:161147
@SQ SN:KI270749.1   LN:158759
@SQ SN:KI270741.1   LN:157432
@SQ SN:GL000221.1   LN:155397
@SQ SN:KI270716.1   LN:153799
@SQ SN:KI270731.1   LN:150754
@SQ SN:KI270751.1   LN:150742
@SQ SN:KI270750.1   LN:148850
@SQ SN:KI270519.1   LN:138126
@SQ SN:GL000214.1   LN:137718
@SQ SN:KI270708.1   LN:127682
@SQ SN:KI270730.1   LN:112551
@SQ SN:KI270438.1   LN:112505
@SQ SN:KI270737.1   LN:103838
@SQ SN:KI270721.1   LN:100316
@SQ SN:KI270738.1   LN:99375
@SQ SN:KI270748.1   LN:93321
@SQ SN:KI270435.1   LN:92983
@SQ SN:GL000208.1   LN:92689
@SQ SN:KI270538.1   LN:91309
@SQ SN:KI270756.1   LN:79590
@SQ SN:KI270739.1   LN:73985
@SQ SN:KI270757.1   LN:71251
@SQ SN:KI270709.1   LN:66860
@SQ SN:KI270746.1   LN:66486
@SQ SN:KI270753.1   LN:62944
@SQ SN:KI270589.1   LN:44474
@SQ SN:KI270726.1   LN:43739
@SQ SN:KI270735.1   LN:42811
@SQ SN:KI270711.1   LN:42210
@SQ SN:KI270745.1   LN:41891
@SQ SN:KI270714.1   LN:41717
@SQ SN:KI270732.1   LN:41543
@SQ SN:KI270713.1   LN:40745
@SQ SN:KI270754.1   LN:40191
@SQ SN:KI270710.1   LN:40176
@SQ SN:KI270717.1   LN:40062
@SQ SN:KI270724.1   LN:39555
@SQ SN:KI270720.1   LN:39050
@SQ SN:KI270723.1   LN:38115
@SQ SN:KI270718.1   LN:38054
@SQ SN:KI270317.1   LN:37690
@SQ SN:KI270740.1   LN:37240
@SQ SN:KI270755.1   LN:36723
@SQ SN:KI270707.1   LN:32032
@SQ SN:KI270579.1   LN:31033
@SQ SN:KI270752.1   LN:27745
@SQ SN:KI270512.1   LN:22689
@SQ SN:KI270322.1   LN:21476
@SQ SN:GL000226.1   LN:15008
@SQ SN:KI270311.1   LN:12399
@SQ SN:KI270366.1   LN:8320
@SQ SN:KI270511.1   LN:8127
@SQ SN:KI270448.1   LN:7992
@SQ SN:KI270521.1   LN:7642
@SQ SN:KI270581.1   LN:7046
@SQ SN:KI270582.1   LN:6504
@SQ SN:KI270515.1   LN:6361
@SQ SN:KI270588.1   LN:6158
@SQ SN:KI270591.1   LN:5796
@SQ SN:KI270522.1   LN:5674
@SQ SN:KI270507.1   LN:5353
@SQ SN:KI270590.1   LN:4685
@SQ SN:KI270584.1   LN:4513
@SQ SN:KI270320.1   LN:4416
@SQ SN:KI270382.1   LN:4215
@SQ SN:KI270468.1   LN:4055
@SQ SN:KI270467.1   LN:3920
@SQ SN:KI270362.1   LN:3530
@SQ SN:KI270517.1   LN:3253
@SQ SN:KI270593.1   LN:3041
@SQ SN:KI270528.1   LN:2983
@SQ SN:KI270587.1   LN:2969
@SQ SN:KI270364.1   LN:2855
@SQ SN:KI270371.1   LN:2805
@SQ SN:KI270333.1   LN:2699
@SQ SN:KI270374.1   LN:2656
@SQ SN:KI270411.1   LN:2646
@SQ SN:KI270414.1   LN:2489
@SQ SN:KI270510.1   LN:2415
@SQ SN:KI270390.1   LN:2387
@SQ SN:KI270375.1   LN:2378
@SQ SN:KI270420.1   LN:2321
@SQ SN:KI270509.1   LN:2318
@SQ SN:KI270315.1   LN:2276
@SQ SN:KI270302.1   LN:2274
@SQ SN:KI270518.1   LN:2186
@SQ SN:KI270530.1   LN:2168
@SQ SN:KI270304.1   LN:2165
@SQ SN:KI270418.1   LN:2145
@SQ SN:KI270424.1   LN:2140
@SQ SN:KI270417.1   LN:2043
@SQ SN:KI270508.1   LN:1951
@SQ SN:KI270303.1   LN:1942
@SQ SN:KI270381.1   LN:1930
@SQ SN:KI270529.1   LN:1899
@SQ SN:KI270425.1   LN:1884
@SQ SN:KI270396.1   LN:1880
@SQ SN:KI270363.1   LN:1803
@SQ SN:KI270386.1   LN:1788
@SQ SN:KI270465.1   LN:1774
@SQ SN:KI270383.1   LN:1750
@SQ SN:KI270384.1   LN:1658
@SQ SN:KI270330.1   LN:1652
@SQ SN:KI270372.1   LN:1650
@SQ SN:KI270548.1   LN:1599
@SQ SN:KI270580.1   LN:1553
@SQ SN:KI270387.1   LN:1537
@SQ SN:KI270391.1   LN:1484
@SQ SN:KI270305.1   LN:1472
@SQ SN:KI270373.1   LN:1451
@SQ SN:KI270422.1   LN:1445
@SQ SN:KI270316.1   LN:1444
@SQ SN:KI270340.1   LN:1428
@SQ SN:KI270338.1   LN:1428
@SQ SN:KI270583.1   LN:1400
@SQ SN:KI270334.1   LN:1368
@SQ SN:KI270429.1   LN:1361
@SQ SN:KI270393.1   LN:1308
@SQ SN:KI270516.1   LN:1300
@SQ SN:KI270389.1   LN:1298
@SQ SN:KI270466.1   LN:1233
@SQ SN:KI270388.1   LN:1216
@SQ SN:KI270544.1   LN:1202
@SQ SN:KI270310.1   LN:1201
@SQ SN:KI270412.1   LN:1179
@SQ SN:KI270395.1   LN:1143
@SQ SN:KI270376.1   LN:1136
@SQ SN:KI270337.1   LN:1121
@SQ SN:KI270335.1   LN:1048
@SQ SN:KI270378.1   LN:1048
@SQ SN:KI270379.1   LN:1045
@SQ SN:KI270329.1   LN:1040
@SQ SN:KI270419.1   LN:1029
@SQ SN:KI270336.1   LN:1026
@SQ SN:KI270312.1   LN:998
@SQ SN:KI270539.1   LN:993
@SQ SN:KI270385.1   LN:990
@SQ SN:KI270423.1   LN:981
@SQ SN:KI270392.1   LN:971
@SQ SN:KI270394.1   LN:970
@RG ID:out_sorted   SM:out_sorted   LB:0.1  PU:out_sorted   PL:ILLUMINA
@PG PN:bwa  ID:bwa  VN:0.7.17-r1198-dirty   CL:/risapps/rhel7/bwa/0.7.17/bwa mem -t24 -T0 /rsrch3/scratch/bcb/jdou1/PM1645_CRISPR/hg38/genome test.CB.fastq
@PG ID:samtools PN:samtools PP:bwa  VN:1.18 CL:samtools view -H chr20.maester_scRNA.bam

The preprocessing step appears to run fine:

$ python ${path}/src/Monopogen.py  preProcess \
>    -b bam.lst \
>    -o bm \
>    -a ${path}/apps \
>    -t 8

[2023-12-12 12:22:01,044] INFO     Monopogen.py Performing data preprocess before variant calling...
[2023-12-12 12:22:01,044] INFO     germline.py Parameters in effect:
[2023-12-12 12:22:01,044] INFO     germline.py --subcommand = [preProcess]
[2023-12-12 12:22:01,044] INFO     germline.py --bamFile = [bam.lst]
[2023-12-12 12:22:01,044] INFO     germline.py --out = [bm]
[2023-12-12 12:22:01,044] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:22:01,044] INFO     germline.py --max_mismatch = [3]
[2023-12-12 12:22:01,044] INFO     germline.py --nthreads = [8]
[2023-12-12 12:22:01,055] DEBUG    Monopogen.py PreProcessing sample bm
[2023-12-12 12:22:01,089] INFO     germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,091] INFO     germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,092] INFO     germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,096] INFO     germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,173] INFO     germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,176] INFO     germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,178] INFO     germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,179] INFO     germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,180] INFO     germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,183] INFO     germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,188] INFO     germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,191] INFO     germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,243] INFO     germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,249] INFO     germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,250] INFO     germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,254] INFO     germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,257] INFO     germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,264] INFO     germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:24:17,990] INFO     Monopogen.py Success! See instructions above.

The germline step emits lots of errors / warnings but still reports that it ran successfully:

$ python  ${path}/src/Monopogen.py  germline  \
>    -a   ${path}/apps \
>    -r  region.lst \
>    -p  ./ \
>    -g  chr20_2Mb.hg38.fa \
>    -m 3 \
>    -t 8 \
>    -s all \
>    -o bm

[2023-12-12 12:31:53,984] INFO     Monopogen.py Performing germline variant calling...
[2023-12-12 12:31:53,985] INFO     germline.py Parameters in effect:
[2023-12-12 12:31:53,985] INFO     germline.py --subcommand = [germline]
[2023-12-12 12:31:53,985] INFO     germline.py --region = [region.lst]
[2023-12-12 12:31:53,985] INFO     germline.py --step = [all]
[2023-12-12 12:31:53,985] INFO     germline.py --out = [bm]
[2023-12-12 12:31:53,985] INFO     germline.py --reference = [chr20_2Mb.hg38.fa]
[2023-12-12 12:31:53,985] INFO     germline.py --imputation_panel = [./]
[2023-12-12 12:31:53,985] INFO     germline.py --max_softClipped = [3]
[2023-12-12 12:31:53,985] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:31:53,985] INFO     germline.py --nthreads = [8]
[2023-12-12 12:31:53,985] INFO     germline.py --norun = [FALSE]
[2023-12-12 12:31:53,985] INFO     Monopogen.py Checking existence of essenstial resource files...
[2023-12-12 12:31:54,050] INFO     Monopogen.py Checking dependencies...
['bash bm/Script/runGermline_chr20.sh']
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Reference allele mismatch at chr20:3000001 .. REF_SEQ:'A' vs VCF:'N'
[W::vcf_parse] Contig 'i' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:3810050 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'A' is not defined in the header
[W::vcf_parse] INFO '0' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0'
[W::vcf_parse] Contig '|' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '50,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=50,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '50,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5046788 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '213,96,0:32' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=213,96,0:32,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '213,96,0:32'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5847394 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '48,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=48,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '48,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:8188931 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0'
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:17944845 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'T' is not defined in the header
[W::vcf_parse] Contig '?=' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:25478115 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?=:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:33389937 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '187,27,0:9' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=187,27,0:9,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '187,27,0:9'
[E::bcf_write] Broken VCF record, the number of columns at chr20:34805043 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '?' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
[E::bcf_write] Broken VCF record, the number of columns at chr20:36572282 does not match the number of samples (0 vs 1)
Error: VCF parse error
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gl=bm/germline/chr20.gl.vcf.gz
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.gp
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
    at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
    at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at vcf.VcfIt.fillEmissionBuffer(VcfIt.java:307)
    at vcf.VcfIt.next(VcfIt.java:363)
    at vcf.VcfIt.next(VcfIt.java:52)
    at vcf.IntervalVcfIt.readNextRecord(IntervalVcfIt.java:110)
    at vcf.IntervalVcfIt.next(IntervalVcfIt.java:92)
    at vcf.IntervalVcfIt.next(IntervalVcfIt.java:36)
    at main.Main.restrictToVcfMarkers(Main.java:343)
    at main.Main.allData(Main.java:313)
    at main.Main.main(Main.java:111)
Caused by: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
    at vcf.VcfRecord.fieldCountError(VcfRecord.java:221)
    at vcf.VcfRecord.delimiters(VcfRecord.java:203)
    at vcf.VcfRecord.<init>(VcfRecord.java:87)
    at vcf.VcfRecord.fromGTGL(VcfRecord.java:193)
    at vcf.VcfIt.lambda$static$5(VcfIt.java:76)
    at vcf.VcfIt.lambda$new$8(VcfIt.java:192)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
    at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
    at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
    at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
gzip: bm/germline/chr20.gp.vcf.gz: No such file or directory
bm/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gt=bm/germline/chr20.germline.vcf
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.phased
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: bm/germline/chr20.germline.vcf
null
    at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
    at vcf.VcfHeader.<init>(VcfHeader.java:119)
    at vcf.VcfIt.<init>(VcfIt.java:190)
    at vcf.VcfIt.create(VcfIt.java:175)
    at vcf.VcfIt.create(VcfIt.java:150)
    at main.Main.allData(Main.java:297)
    at main.Main.main(Main.java:111)
[2023-12-12 12:36:05,744] INFO     Monopogen.py Success! See instructions above.

Then when I run the somatic featureInfo step, I get an error:

$ python ${path}/src/Monopogen.py  somatic  \
>     -a ${path}/apps \
>     -r region.lst \
>     -t 50 \
>     -i bm \
>     -l CB_7K.maester_scRNA.csv \
>     -s featureInfo \
>     -g chr20_2Mb.hg38.fa 

[2023-12-12 12:38:06,217] INFO     Monopogen.py Get feature information from sequencing data...
[E::hts_open_format] Failed to open file "bm/germline/chr20.phased.vcf.gz" : No such file or directory
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/nfs/team205/at31/software/Monopogen/src/somatic.py", line 88, in featureInfo
    vcf_in = VariantFile(out + "/germline/" +  region + ".phased.vcf.gz")
  File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 436, in <module>
    main()
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 429, in main
    args.func(args)
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 150, in somatic
    result = pool.map(featureInfo, joblst)
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory

Are you familiar with this issue? I re-downloaded chr20.maester_scRNA.bam multiple times to make sure it wasn't somehow truncated during the download, but the issue persists. Do you have any idea what I'm doing wrong?

Thanks so much in advance!

jinzhuangdou commented 7 months ago

Could you check whether the bam file looks normal after the preprocessing step?

alextidd commented 7 months ago

bm_chr20.filter.bam seems right, but chr20.filter.targeted.bam and merge.filter.targeted.bam are empty:

$ cat bm/Bam/chr20.filter.bam.lst 
bm/Bam/bm_chr20.filter.bam

$ samtools view bm/Bam/bm_chr20.filter.bam | head -2
SRR15598776.294805248_TCTTGCGCAGGCACAC_CTCTCCTCGCCA 2064    chr20   60218   0   10H20M61H   *   0   0   GCAATCCTTTCCTCTCCATT    ,,:,:,,,,,,,,,,,,::F    NM:i:0  MD:Z:20 AS:i:20 XS:i:19 SA:Z:5,113545723,-,22S37M32S,0,0;5,156763440,+,22S19M50S,0,0;   XA:Z:11,+88315939,60S19M12S,0;
SRR15598776.80450257_AATGAAGTCGCGGACT_AAGTATACGCTC  16  chr20   61569   0   43S19M29S   *   0   0   TATACATATGTCCCCTCCCTCTTGAATCTCACCTTCTACCCCCGACTCCATTCCACTCCTCTAGGTTGTCACAGAGCACCGCATTTGGCTC FF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0  MD:Z:19 AS:i:19 XS:i:19 SA:Z:5,64380457,-,19S19M53S,0,0;15,76070764,-,5S19M67S,0,0;X,10097594,-,67S19M5S,0,0;   XA:Z:Y,-10959620,44S19M28S,0;KI270736.1,+62338,28S19M44S,0;

$ samtools view bm/Bam/bm_chr20.filter.bam | tail -2
SRR15598776.418134198_TTTCATGAGCAACTTC_TCACGTTTCCTA 2064    chr20   39999757    0   25H24M42H   *   0   0   ATATTTAAAATTTTATTTTTTAAT    ,,:FF,:,FF,,::FF,,,FF,:,    NM:i:0  MD:Z:24 AS:i:24 XS:i:23 SA:Z:X,46457624,-,3M1I25M62S,0,1;6,124139067,-,53S23M15S,0,0;7,129403981,-,66S22M3S,0,0;8,102417610,+,30S20M41S,0,0;
SRR15598776.180926021_GACTCAAAGTTCACTG_ACCATATAAAAT 2048    chr20   39999869    0   70H21M  *   0   0   ATTTACTATAAAAATAGTTAC   F,,:,,,,,:::,,,,,,,,,   NM:i:1  MD:Z:20T0   AS:i:20 XS:i:20 SA:Z:1,82576961,-,21S39M31S,0,0;4,180292349,+,20S19M52S,0,0;    XA:Z:5,-14292491,7S25M59S,1;11,+95173927,67S19M5S,0;14,-21926828,8S19M64S,0;

$ samtools view bm/Bam/chr20.filter.targeted.bam 

$ samtools view bm/Bam/merge.filter.targeted.bam

$ ls -ltrh bm/Bam/
total 347M
drwxr-sr-x 2 at31 team205 632K Dec 11 17:48 split_bam
-rw-r--r-- 1 at31 team205 2.9K Dec 12 11:38 chr20.filter.targeted.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 11:38 chr20.filter.targeted.bam.bai
-rw-r--r-- 1 at31 team205 2.9K Dec 12 11:38 merge.filter.targeted.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 11:38 merge.filter.targeted.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr5.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr6.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr3.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr4.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr7.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr1.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr8.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr2.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr6.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr5.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr3.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr7.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr1.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr4.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr8.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr2.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr10.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr9.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr10.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr12.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr11.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr9.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr14.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr15.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr12.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr13.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr16.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr11.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr14.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr15.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr13.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr16.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr17.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr18.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr19.filter.bam
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr21.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr17.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr18.filter.bam.bai
-rw-r--r-- 1 at31 team205 2.8K Dec 12 12:22 bm_chr22.filter.bam
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr19.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr21.filter.bam.bai
-rw-r--r-- 1 at31 team205 1.6K Dec 12 12:22 bm_chr22.filter.bam.bai
-rw-r--r-- 1 at31 team205 237M Dec 12 12:23 bm_chr20.filter.bam
-rw-r--r-- 1 at31 team205  65K Dec 12 12:24 bm_chr20.filter.bam.bai
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr1.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr2.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr3.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr4.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr5.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr6.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr7.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr8.filter.bam.lst
-rw-r--r-- 1 at31 team205   26 Dec 12 12:24 chr9.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr10.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr11.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr12.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr13.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr14.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr15.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr16.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr17.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr18.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr19.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr20.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr21.filter.bam.lst
-rw-r--r-- 1 at31 team205   27 Dec 12 12:24 chr22.filter.bam.lst

Do you know how this could have happened?

SuqinYang commented 7 months ago

Hi, I want to know how you got the "chr20.maester_scRNA.bam.bai" file, the bai file I got with 'samtools index' gets an error when it runs. Error: ValueError: fetch called on bamfile without index.

jinzhuangdou commented 7 months ago

could you have a try by using the samtools included in the app? Not sure the bam files have different format under different samtools version.

alextidd commented 6 months ago

Hi, is there any update on my issue? Thanks!

ariannadib commented 2 months ago

hello! @alextidd I have a similar problem but with the somatic module can I ask you at what point in the process did you generate the filter.targeted.bam files? in the preprocess module? Thanks in advance

WaqarHusain commented 1 week ago

Hi, I have the same problem. Did someone manage to resolve it?