KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
68 stars 16 forks source link

germ line module not detecting full chromosome #65

Open aidanshoham12 opened 4 weeks ago

aidanshoham12 commented 4 weeks ago

Hello, I seem to be having an issue with the germline module. I'm working on the bone marrow single cell sample provided in the GitHub. Upon running the germ line module, I can only detect a much smaller number of SNVs than I should be. (around 600 detected by me and 10000 detected in your tutorial) Window 1 [ chr20:273372-1542468 ] reference markers: 635 target markers: 635 From what I see, the lower number of SNVs detected is probably due to a smaller range of chr20 that was used by the tool (as shown in resource directory). I specified chr20 the same way you do in the region.lst file and still cant seem to scan the entire chromosome. The time spent building the model also seems to be much smaller than it should be. Number of markers: 576 Total time for building model: 9 seconds Total time for sampling: 0 seconds Total run time: 15 seconds From all of this, I think the tool might be having some issues detecting the region to detect SNVs. From what I understood, specifying just chr20 in the region.lst file was enough to be the entire chromosome 20 and not a section of it. Do you have any idea of how to get the tool to be able to recognize a more broad section of the chromosome? I'd be open to any ideas Thank you so much for your help!

aidanshoham12 commented 4 weeks ago

Hello I just wanted to update about the above issue, I think the issue was that I was using the CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz that only contained 2 Mb of SNVs instead of the whole chromosome. This would be the reason for why im detecting a smaller number of SNVs. I tried rerunning the sample using the downloaded panel for chromosome 1 and got the following error message: (...) [2024-06-24 10:21:24,990] INFO germline.py --nthreads = [1] [2024-06-24 10:21:24,990] INFO germline.py --norun = [FALSE] [2024-06-24 10:21:24,990] INFO Monopogen.py Checking existence of essenstial resource files... [2024-06-24 10:21:25,004] INFO Monopogen.py Checking dependencies... [mpileup] 1 samples in 1 input files (mpileup) Max depth is above 1M. Potential memory hog! Lines total/split/realigned/skipped: 209691145/485374/116534/0 Picked up JAVA_TOOL_OPTIONS: -Xmx2g Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.base/java.util.HashMap.resize(HashMap.java:702) at java.base/java.util.HashMap.putVal(HashMap.java:661) at java.base/java.util.HashMap.put(HashMap.java:610) at java.base/java.util.HashSet.add(HashSet.java:221) at main.Main.restrictToVcfMarkers(Main.java:343) at main.Main.allData(Main.java:313) at main.Main.main(Main.java:111) gzip: path/to/germline/chr1.gp.vcf.gz: No such file or directory path/to/germline/chr1.gp.vcf.gz: No such file or directory Picked up JAVA_TOOL_OPTIONS: -Xmx2g Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines File source: path/to/germline/chr1.germline.vcf null at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135) at vcf.VcfHeader.(VcfHeader.java:119) at vcf.VcfIt.(VcfIt.java:190) at vcf.VcfIt.create(VcfIt.java:175) at vcf.VcfIt.create(VcfIt.java:150) at main.Main.allData(Main.java:297) at main.Main.main(Main.java:111) [2024-06-24 12:00:52,537] INFO Monopogen.py Success! See instructions above. I'm not encountering this issue with the 2Mb version of CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz but am having issues when using the whole chromosome. I think it might have to do with the amount of RAM given to beagle in the Picked up JAVA_TOOL_OPTIONS: -Xmx2g argument. Is it possible to increase 2g higher? Im open to any suggestions Thank you so much for your help!

jinzhuangdou commented 3 weeks ago

Do you have chr1.gl.vcf.gz file generated in the germline folder? How many variants included in the file?

aidanshoham12 commented 3 weeks ago

hello, here are the three files produced after running the germline module with their respective sizes: 8195748066, chr1.gl.vcf.gz 852, chr1.gp.log 859, chr1.phased.log The chr1.gl.vcf.gz contains 871954 SNVs that were detected by Monopogen. This seems to be around 4X the normal amount (usually 249000 from google on chr1). For reference, this is a tumor sample and is expected to contain a much larger amount of SNVs compared to the normal tissues presented in the GitHub tutorial. I am interested in conducting lineage tracing in tumor samples but im not sure if the large number of SNVs will overwhelm the tool. let me know what you think Thank you again!