Open MetteBoge opened 6 months ago
Could you list which files in the generated working folder "germline"? Are there *.phased.vcf.gz file generated? Also, are you working on single cell data? The beagle step should not take so long time since the data is quite sparse.
Hi, this is the files I have in the germline dir:
chr10:100000001-133797422.gl.vcf.gz chr2:150000001-200000001.phased.vcf.gz
chr10:100000001-133797422.gp.log chr2:1-50000001.gl.vcf.gz
chr10:100000001-133797422.gp.vcf.gz chr2:1-50000001.gp.log
chr10:100000001-133797422.phased.log chr2:1-50000001.gp.vcf.gz
chr10:100000001-133797422.phased.vcf.gz chr2:1-50000001.phased.log
chr10:1-50000001.gl.vcf.gz chr2:1-50000001.phased.vcf.gz
chr10:1-50000001.gp.log chr21.gl.vcf.gz
chr10:1-50000001.gp.vcf.gz chr21.gp.log
chr10:1-50000001.phased.log chr21.gp.vcf.gz
chr10:1-50000001.phased.vcf.gz chr21.phased.log
chr10:50000001-100000001.gl.vcf.gz chr21.phased.vcf.gz
chr10:50000001-100000001.gp.log chr2:200000001-242193529.gl.vcf.gz
chr10:50000001-100000001.gp.vcf.gz chr2:200000001-242193529.gp.log
chr10:50000001-100000001.phased.log chr2:200000001-242193529.gp.vcf.gz
chr10:50000001-100000001.phased.vcf.gz chr2:200000001-242193529.phased.log
chr1:100000001-150000001.gl.vcf.gz chr2:200000001-242193529.phased.vcf.gz
chr1:100000001-150000001.gp.log chr22.gl.vcf.gz
chr1:100000001-150000001.gp.vcf.gz chr22.gp.log
chr1:100000001-150000001.phased.log chr22.gp.vcf.gz
chr1:100000001-150000001.phased.vcf.gz chr22.phased.log
chr11:100000001-135086622.gl.vcf.gz chr22.phased.vcf.gz
chr11:100000001-135086622.gp.log chr2:50000001-100000001.gl.vcf.gz
chr11:100000001-135086622.gp.vcf.gz chr2:50000001-100000001.gp.log
chr11:100000001-135086622.phased.log chr2:50000001-100000001.gp.vcf.gz
chr11:100000001-135086622.phased.vcf.gz chr2:50000001-100000001.phased.log
chr11:1-50000001.gl.vcf.gz chr2:50000001-100000001.phased.vcf.gz
chr11:1-50000001.gp.log chr3:100000001-150000001.gl.vcf.gz
chr11:1-50000001.gp.vcf.gz chr3:100000001-150000001.gp.log
chr11:1-50000001.phased.log chr3:100000001-150000001.gp.vcf.gz
chr11:1-50000001.phased.vcf.gz chr3:100000001-150000001.phased.log
chr11:50000001-100000001.gl.vcf.gz chr3:100000001-150000001.phased.vcf.gz
chr11:50000001-100000001.gp.log chr3:150000001-198295559.gl.vcf.gz
chr11:50000001-100000001.gp.vcf.gz chr3:150000001-198295559.gp.log
chr11:50000001-100000001.phased.log chr3:150000001-198295559.gp.vcf.gz
chr11:50000001-100000001.phased.vcf.gz chr3:150000001-198295559.phased.log
chr1:150000001-200000001.gl.vcf.gz chr3:150000001-198295559.phased.vcf.gz
chr1:150000001-200000001.gp.log chr3:1-50000001.gl.vcf.gz
chr1:150000001-200000001.gp.vcf.gz chr3:1-50000001.gp.log
chr1:150000001-200000001.phased.log chr3:1-50000001.gp.vcf.gz
chr1:150000001-200000001.phased.vcf.gz chr3:1-50000001.phased.log
chr1:1-50000001.gl.vcf.gz chr3:1-50000001.phased.vcf.gz
chr1:1-50000001.gp.log chr3:50000001-100000001.gl.vcf.gz
chr1:1-50000001.gp.vcf.gz chr3:50000001-100000001.gp.log
chr1:1-50000001.phased.log chr3:50000001-100000001.gp.vcf.gz
chr1:1-50000001.phased.vcf.gz chr3:50000001-100000001.phased.log
chr1:200000001-248956422.gl.vcf.gz chr3:50000001-100000001.phased.vcf.gz
chr1:200000001-248956422.gp.log chr4:100000001-150000001.gl.vcf.gz
chr1:200000001-248956422.gp.vcf.gz chr4:100000001-150000001.gp.log
chr1:200000001-248956422.phased.log chr4:100000001-150000001.gp.vcf.gz
chr1:200000001-248956422.phased.vcf.gz chr4:100000001-150000001.phased.log
chr12:100000001-133275309.gl.vcf.gz chr4:100000001-150000001.phased.vcf.gz
chr12:100000001-133275309.gp.log chr4:150000001-190214555.gl.vcf.gz
chr12:100000001-133275309.gp.vcf.gz chr4:150000001-190214555.gp.log
chr12:100000001-133275309.phased.log chr4:150000001-190214555.gp.vcf.gz
chr12:100000001-133275309.phased.vcf.gz chr4:150000001-190214555.phased.log
chr12:1-50000001.gl.vcf.gz chr4:150000001-190214555.phased.vcf.gz
chr12:1-50000001.gp.log chr4:1-50000001.gl.vcf.gz
chr12:1-50000001.gp.vcf.gz chr4:1-50000001.gp.log
chr12:1-50000001.phased.log chr4:1-50000001.gp.vcf.gz
chr12:1-50000001.phased.vcf.gz chr4:1-50000001.phased.log
chr12:50000001-100000001.gl.vcf.gz chr4:1-50000001.phased.vcf.gz
chr12:50000001-100000001.gp.log chr4:50000001-100000001.gl.vcf.gz
chr12:50000001-100000001.gp.vcf.gz chr4:50000001-100000001.gp.log
chr12:50000001-100000001.phased.log chr4:50000001-100000001.gp.vcf.gz
chr12:50000001-100000001.phased.vcf.gz chr4:50000001-100000001.phased.log
chr13:100000001-114364328.gl.vcf.gz chr4:50000001-100000001.phased.vcf.gz
chr13:100000001-114364328.gp.log chr5:100000001-150000001.gl.vcf.gz
chr13:100000001-114364328.gp.vcf.gz chr5:100000001-150000001.gp.log
chr13:100000001-114364328.phased.log chr5:100000001-150000001.gp.vcf.gz
chr13:100000001-114364328.phased.vcf.gz chr5:100000001-150000001.phased.log
chr13:1-50000001.gl.vcf.gz chr5:100000001-150000001.phased.vcf.gz
chr13:1-50000001.gp.log chr5:150000001-181538259.gl.vcf.gz
chr13:1-50000001.gp.vcf.gz chr5:150000001-181538259.gp.log
chr13:1-50000001.phased.log chr5:150000001-181538259.gp.vcf.gz
chr13:1-50000001.phased.vcf.gz chr5:150000001-181538259.phased.log
chr13:50000001-100000001.gl.vcf.gz chr5:150000001-181538259.phased.vcf.gz
chr13:50000001-100000001.gp.log chr5:1-50000001.gl.vcf.gz
chr13:50000001-100000001.gp.vcf.gz chr5:1-50000001.gp.log
chr13:50000001-100000001.phased.log chr5:1-50000001.gp.vcf.gz
chr13:50000001-100000001.phased.vcf.gz chr5:1-50000001.phased.log
chr14:100000001-107043718.gl.vcf.gz chr5:1-50000001.phased.vcf.gz
chr14:100000001-107043718.gp.log chr5:50000001-100000001.gl.vcf.gz
chr14:100000001-107043718.gp.vcf.gz chr5:50000001-100000001.gp.log
chr14:100000001-107043718.phased.log chr5:50000001-100000001.gp.vcf.gz
chr14:100000001-107043718.phased.vcf.gz chr5:50000001-100000001.phased.log
chr14:1-50000001.gl.vcf.gz chr5:50000001-100000001.phased.vcf.gz
chr14:1-50000001.gp.log chr6:100000001-150000001.gl.vcf.gz
chr14:1-50000001.gp.vcf.gz chr6:100000001-150000001.gp.log
chr14:1-50000001.phased.log chr6:100000001-150000001.gp.vcf.gz
chr14:1-50000001.phased.vcf.gz chr6:100000001-150000001.phased.log
chr14:50000001-100000001.gl.vcf.gz chr6:100000001-150000001.phased.vcf.gz
chr14:50000001-100000001.gp.log chr6:150000001-170805979.gl.vcf.gz
chr14:50000001-100000001.gp.vcf.gz chr6:150000001-170805979.gp.log
chr14:50000001-100000001.phased.log chr6:150000001-170805979.gp.vcf.gz
chr14:50000001-100000001.phased.vcf.gz chr6:150000001-170805979.phased.log
chr1:50000001-100000001.gl.vcf.gz chr6:150000001-170805979.phased.vcf.gz
chr1:50000001-100000001.gp.log chr6:1-50000001.gl.vcf.gz
chr1:50000001-100000001.gp.vcf.gz chr6:1-50000001.gp.log
chr1:50000001-100000001.phased.log chr6:1-50000001.gp.vcf.gz
chr1:50000001-100000001.phased.vcf.gz chr6:1-50000001.phased.log
chr15:100000001-101991189.gl.vcf.gz chr6:1-50000001.phased.vcf.gz
chr15:100000001-101991189.gp.log chr6:50000001-100000001.gl.vcf.gz
chr15:100000001-101991189.gp.vcf.gz chr6:50000001-100000001.gp.log
chr15:100000001-101991189.phased.log chr6:50000001-100000001.gp.vcf.gz
chr15:100000001-101991189.phased.vcf.gz chr6:50000001-100000001.phased.log
chr15:1-50000001.gl.vcf.gz chr6:50000001-100000001.phased.vcf.gz
chr15:1-50000001.gp.log chr7:100000001-150000001.gl.vcf.gz
chr15:1-50000001.gp.vcf.gz chr7:100000001-150000001.gp.log
chr15:1-50000001.phased.log chr7:100000001-150000001.gp.vcf.gz
chr15:1-50000001.phased.vcf.gz chr7:100000001-150000001.phased.log
chr15:50000001-100000001.gl.vcf.gz chr7:100000001-150000001.phased.vcf.gz
chr15:50000001-100000001.gp.log chr7:150000001-159345973.gl.vcf.gz
chr15:50000001-100000001.gp.vcf.gz chr7:150000001-159345973.gp.log
chr15:50000001-100000001.phased.log chr7:150000001-159345973.gp.vcf.gz
chr15:50000001-100000001.phased.vcf.gz chr7:150000001-159345973.phased.log
chr16:1-50000001.gl.vcf.gz chr7:150000001-159345973.phased.vcf.gz
chr16:1-50000001.gp.log chr7:1-50000001.gl.vcf.gz
chr16:1-50000001.gp.vcf.gz chr7:1-50000001.gp.log
chr17:1-50000001.gl.vcf.gz chr7:1-50000001.gp.vcf.gz
chr17:1-50000001.gp.log chr7:1-50000001.phased.log
chr17:1-50000001.gp.vcf.gz chr7:1-50000001.phased.vcf.gz
chr17:1-50000001.phased.log chr7:50000001-100000001.gl.vcf.gz
chr17:1-50000001.phased.vcf.gz chr7:50000001-100000001.gp.log
chr17:50000001-83257441.gl.vcf.gz chr7:50000001-100000001.gp.vcf.gz
chr17:50000001-83257441.gp.log chr7:50000001-100000001.phased.log
chr17:50000001-83257441.gp.vcf.gz chr7:50000001-100000001.phased.vcf.gz
chr17:50000001-83257441.phased.log chr8:100000001-145138636.gl.vcf.gz
chr17:50000001-83257441.phased.vcf.gz chr8:100000001-145138636.gp.log
chr18:1-50000001.gl.vcf.gz chr8:100000001-145138636.gp.vcf.gz
chr18:1-50000001.gp.log chr8:100000001-145138636.phased.log
chr18:1-50000001.gp.vcf.gz chr8:100000001-145138636.phased.vcf.gz
chr18:1-50000001.phased.log chr8:1-50000001.gl.vcf.gz
chr18:1-50000001.phased.vcf.gz chr8:1-50000001.gp.log
chr18:50000001-80373285.gl.vcf.gz chr8:1-50000001.gp.vcf.gz
chr18:50000001-80373285.gp.log chr8:1-50000001.phased.log
chr18:50000001-80373285.gp.vcf.gz chr8:1-50000001.phased.vcf.gz
chr18:50000001-80373285.phased.log chr8:50000001-100000001.gl.vcf.gz
chr18:50000001-80373285.phased.vcf.gz chr8:50000001-100000001.gp.log
chr19.gl.vcf.gz chr8:50000001-100000001.gp.vcf.gz
chr19.gp.log chr8:50000001-100000001.phased.log
chr19.gp.vcf.gz chr8:50000001-100000001.phased.vcf.gz
chr19.phased.log chr9:100000001-138394717.gl.vcf.gz
chr19.phased.vcf.gz chr9:100000001-138394717.gp.log
chr20.gl.vcf.gz chr9:100000001-138394717.gp.vcf.gz
chr20.gp.log chr9:100000001-138394717.phased.log
chr20.gp.vcf.gz chr9:100000001-138394717.phased.vcf.gz
chr20.phased.log chr9:1-50000001.gl.vcf.gz
chr20.phased.vcf.gz chr9:1-50000001.gp.log
chr2:100000001-150000001.gl.vcf.gz chr9:1-50000001.gp.vcf.gz
chr2:100000001-150000001.gp.log chr9:1-50000001.phased.log
chr2:100000001-150000001.gp.vcf.gz chr9:1-50000001.phased.vcf.gz
chr2:100000001-150000001.phased.log chr9:50000001-100000001.gl.vcf.gz
chr2:100000001-150000001.phased.vcf.gz chr9:50000001-100000001.gp.log
chr2:150000001-200000001.gl.vcf.gz chr9:50000001-100000001.gp.vcf.gz
chr2:150000001-200000001.gp.log chr9:50000001-100000001.phased.log
chr2:150000001-200000001.gp.vcf.gz chr9:50000001-100000001.phased.vcf.gz
chr2:150000001-200000001.phased.log
I am working on cell free RNAseq, so no, not single cell RNAseq data. But I am using this tool, because I expect the quality of the data and also sparsity to be similar to scRNA.
That makes sense for the long time running of Monopogen since cell free RNA-seq may have more genome regions covered than single cell data. Could you check whether all regions are finished? For example cat *.phased.log | grep finished | wc -l
to see how many files finished. There may have some regions/segments failed and Monopogen is stuck on such regions
cat *phased.log | grep finished | wc -l (base) 62
Same number of regions. So I dont think its because its still running. I checked on htop the amount of CPU the process use, and it seems like after the 24h (and more than 12h since any updates to output files), it is using very low amount / almost no CPU (compared to when it is writing files). Like I dont think it is actively doing anything else but just not quitting the job...
And I was running it on 11 samples at the same time. Not one finished. With bulk variant callers I expect around 10k variants to be found in the bam files. So I doubt it is a problem with data size.
Thanks for your examination. Do you mean you have only 10K variants across 22 chromosomes or in each region? I may debug this a bit.
I ran DeepVariant on the data and post filtering (DP<10, might be too stringent...) I had 11k variants. I am still using the output vcf files from Monopogen germline, since they seem to be finished. Only problem is the job is not quitting. After filtering merged vcf from monopogen, I find approx 30K. So 3 fold difference, which is low but still much better. I would just really hope you can help me fix the not quitting job problem, so I can run it in a nextflow pipeline.
Sure. Will feedback to you after I examine the job collection procedure.
Give you do not have too many markers, it is better to run imputation in one whole chromosome. You can achieve this by inputting the region list with chr1 chr2 ...
This will reduce the job collection complexity. Or you can remove the job collection module in and merge such phased vcf file out of Monopogen? Just run all sub jobs listed in joblst (Line 123-127) https://github.com/KChen-lab/Monopogen/blob/main/src/Monopogen.py
My region list is the one from your resources :
chr1,1,50000001
chr1,50000001,100000001
chr1,100000001,150000001
chr1,150000001,200000001
chr1,200000001,248956422
chr2,1,50000001
chr2,50000001,100000001
chr2,100000001,150000001
chr2,150000001,200000001
chr2,200000001,242193529
chr3,1,50000001
chr3,50000001,100000001
chr3,100000001,150000001
chr3,150000001,198295559
chr4,1,50000001
chr4,50000001,100000001
chr4,100000001,150000001
chr4,150000001,190214555
chr5,1,50000001
chr5,50000001,100000001
chr5,100000001,150000001
chr5,150000001,181538259
chr6,1,50000001
...
Do you suggest I just run it as chr1 chr2 chr3, without the subdivision into chrlocation? As it is of now I just merge the 62 phased vcf files with bcftools concat. Seems to work fine
Thank you for looking into it !
Yes, you can run imputation/phasing in one whole chromosome and it will further increase genotyping accuracy since the marker panel is very sparse.
Thanks, I will try that
@jinzhuangdou Hi again, I tried to run it again on different samples. I found out that I am in fact missing a .phased.log file (and phased.vcf file). And it seems to be for chr16 for all (checked a couple) of the samples. Could it be due to a flaw in the imputation panel file? From a quick look, it has about same size as the imputation panel file for chr17. Hope this can give you a clue to whats going on. Best regards
Hi, I am running PreProcess and then Germline in a process in nextflow. But for some reason, the job never finish even though all the output files are written to the output directory. I get beagle finished but no "Monopogen.py Success! See instructions above." printed for germline (only preProcess)
The process:
The input files: GRCh38.primary_assembly.genome.fa (indexed in same directory) Directory with imputation panel files from ftp link you have provided. Region list Monopogen/resource/GRCh38.region.lst
Htop indicates that beagle is still running (more than 12h after last 'beagle finished' in log).
Can you maybe help me to why the job wont finish?
Best regards, Mette