KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
80 stars 17 forks source link

Germline job never finish #47

Open MetteBoge opened 6 months ago

MetteBoge commented 6 months ago

Hi, I am running PreProcess and then Germline in a process in nextflow. But for some reason, the job never finish even though all the output files are written to the output directory. I get beagle finished but no "Monopogen.py Success! See instructions above." printed for germline (only preProcess)

The process:

"""
echo "$sample_name,$input_bam" > input_bam_${sample_name}.lst

python $mono_bin preProcess \
-b input_bam_${sample_name}.lst \
-o output_monopogen_${sample_name}/ \
-a $app_dir \
-t 8

python $mono_bin  germline  \
-a  $app_dir \
-t 8 \
-g  $reference \
-p  $phased_panel/ \
-r  $region \
-s all  \
-o output_monopogen_${sample_name}
"""

The input files: GRCh38.primary_assembly.genome.fa (indexed in same directory) Directory with imputation panel files from ftp link you have provided. Region list Monopogen/resource/GRCh38.region.lst

Htop indicates that beagle is still running (more than 12h after last 'beagle finished' in log).

End time: 04:13 PM PDT on 26 Mar 2024
beagle.27Jul16.86a.jar (version 4.1) finished

Can you maybe help me to why the job wont finish?

Best regards, Mette

jinzhuangdou commented 6 months ago

Could you list which files in the generated working folder "germline"? Are there *.phased.vcf.gz file generated? Also, are you working on single cell data? The beagle step should not take so long time since the data is quite sparse.

MetteBoge commented 6 months ago

Hi, this is the files I have in the germline dir:

chr10:100000001-133797422.gl.vcf.gz      chr2:150000001-200000001.phased.vcf.gz
chr10:100000001-133797422.gp.log         chr2:1-50000001.gl.vcf.gz
chr10:100000001-133797422.gp.vcf.gz      chr2:1-50000001.gp.log
chr10:100000001-133797422.phased.log     chr2:1-50000001.gp.vcf.gz
chr10:100000001-133797422.phased.vcf.gz  chr2:1-50000001.phased.log
chr10:1-50000001.gl.vcf.gz               chr2:1-50000001.phased.vcf.gz
chr10:1-50000001.gp.log                  chr21.gl.vcf.gz
chr10:1-50000001.gp.vcf.gz               chr21.gp.log
chr10:1-50000001.phased.log              chr21.gp.vcf.gz
chr10:1-50000001.phased.vcf.gz           chr21.phased.log
chr10:50000001-100000001.gl.vcf.gz       chr21.phased.vcf.gz
chr10:50000001-100000001.gp.log          chr2:200000001-242193529.gl.vcf.gz
chr10:50000001-100000001.gp.vcf.gz       chr2:200000001-242193529.gp.log
chr10:50000001-100000001.phased.log      chr2:200000001-242193529.gp.vcf.gz
chr10:50000001-100000001.phased.vcf.gz   chr2:200000001-242193529.phased.log
chr1:100000001-150000001.gl.vcf.gz       chr2:200000001-242193529.phased.vcf.gz
chr1:100000001-150000001.gp.log          chr22.gl.vcf.gz
chr1:100000001-150000001.gp.vcf.gz       chr22.gp.log
chr1:100000001-150000001.phased.log      chr22.gp.vcf.gz
chr1:100000001-150000001.phased.vcf.gz   chr22.phased.log
chr11:100000001-135086622.gl.vcf.gz      chr22.phased.vcf.gz
chr11:100000001-135086622.gp.log         chr2:50000001-100000001.gl.vcf.gz
chr11:100000001-135086622.gp.vcf.gz      chr2:50000001-100000001.gp.log
chr11:100000001-135086622.phased.log     chr2:50000001-100000001.gp.vcf.gz
chr11:100000001-135086622.phased.vcf.gz  chr2:50000001-100000001.phased.log
chr11:1-50000001.gl.vcf.gz               chr2:50000001-100000001.phased.vcf.gz
chr11:1-50000001.gp.log                  chr3:100000001-150000001.gl.vcf.gz
chr11:1-50000001.gp.vcf.gz               chr3:100000001-150000001.gp.log
chr11:1-50000001.phased.log              chr3:100000001-150000001.gp.vcf.gz
chr11:1-50000001.phased.vcf.gz           chr3:100000001-150000001.phased.log
chr11:50000001-100000001.gl.vcf.gz       chr3:100000001-150000001.phased.vcf.gz
chr11:50000001-100000001.gp.log          chr3:150000001-198295559.gl.vcf.gz
chr11:50000001-100000001.gp.vcf.gz       chr3:150000001-198295559.gp.log
chr11:50000001-100000001.phased.log      chr3:150000001-198295559.gp.vcf.gz
chr11:50000001-100000001.phased.vcf.gz   chr3:150000001-198295559.phased.log
chr1:150000001-200000001.gl.vcf.gz       chr3:150000001-198295559.phased.vcf.gz
chr1:150000001-200000001.gp.log          chr3:1-50000001.gl.vcf.gz
chr1:150000001-200000001.gp.vcf.gz       chr3:1-50000001.gp.log
chr1:150000001-200000001.phased.log      chr3:1-50000001.gp.vcf.gz
chr1:150000001-200000001.phased.vcf.gz   chr3:1-50000001.phased.log
chr1:1-50000001.gl.vcf.gz                chr3:1-50000001.phased.vcf.gz
chr1:1-50000001.gp.log                   chr3:50000001-100000001.gl.vcf.gz
chr1:1-50000001.gp.vcf.gz                chr3:50000001-100000001.gp.log
chr1:1-50000001.phased.log               chr3:50000001-100000001.gp.vcf.gz
chr1:1-50000001.phased.vcf.gz            chr3:50000001-100000001.phased.log
chr1:200000001-248956422.gl.vcf.gz       chr3:50000001-100000001.phased.vcf.gz
chr1:200000001-248956422.gp.log          chr4:100000001-150000001.gl.vcf.gz
chr1:200000001-248956422.gp.vcf.gz       chr4:100000001-150000001.gp.log
chr1:200000001-248956422.phased.log      chr4:100000001-150000001.gp.vcf.gz
chr1:200000001-248956422.phased.vcf.gz   chr4:100000001-150000001.phased.log
chr12:100000001-133275309.gl.vcf.gz      chr4:100000001-150000001.phased.vcf.gz
chr12:100000001-133275309.gp.log         chr4:150000001-190214555.gl.vcf.gz
chr12:100000001-133275309.gp.vcf.gz      chr4:150000001-190214555.gp.log
chr12:100000001-133275309.phased.log     chr4:150000001-190214555.gp.vcf.gz
chr12:100000001-133275309.phased.vcf.gz  chr4:150000001-190214555.phased.log
chr12:1-50000001.gl.vcf.gz               chr4:150000001-190214555.phased.vcf.gz
chr12:1-50000001.gp.log                  chr4:1-50000001.gl.vcf.gz
chr12:1-50000001.gp.vcf.gz               chr4:1-50000001.gp.log
chr12:1-50000001.phased.log              chr4:1-50000001.gp.vcf.gz
chr12:1-50000001.phased.vcf.gz           chr4:1-50000001.phased.log
chr12:50000001-100000001.gl.vcf.gz       chr4:1-50000001.phased.vcf.gz
chr12:50000001-100000001.gp.log          chr4:50000001-100000001.gl.vcf.gz
chr12:50000001-100000001.gp.vcf.gz       chr4:50000001-100000001.gp.log
chr12:50000001-100000001.phased.log      chr4:50000001-100000001.gp.vcf.gz
chr12:50000001-100000001.phased.vcf.gz   chr4:50000001-100000001.phased.log
chr13:100000001-114364328.gl.vcf.gz      chr4:50000001-100000001.phased.vcf.gz
chr13:100000001-114364328.gp.log         chr5:100000001-150000001.gl.vcf.gz
chr13:100000001-114364328.gp.vcf.gz      chr5:100000001-150000001.gp.log
chr13:100000001-114364328.phased.log     chr5:100000001-150000001.gp.vcf.gz
chr13:100000001-114364328.phased.vcf.gz  chr5:100000001-150000001.phased.log
chr13:1-50000001.gl.vcf.gz               chr5:100000001-150000001.phased.vcf.gz
chr13:1-50000001.gp.log                  chr5:150000001-181538259.gl.vcf.gz
chr13:1-50000001.gp.vcf.gz               chr5:150000001-181538259.gp.log
chr13:1-50000001.phased.log              chr5:150000001-181538259.gp.vcf.gz
chr13:1-50000001.phased.vcf.gz           chr5:150000001-181538259.phased.log
chr13:50000001-100000001.gl.vcf.gz       chr5:150000001-181538259.phased.vcf.gz
chr13:50000001-100000001.gp.log          chr5:1-50000001.gl.vcf.gz
chr13:50000001-100000001.gp.vcf.gz       chr5:1-50000001.gp.log
chr13:50000001-100000001.phased.log      chr5:1-50000001.gp.vcf.gz
chr13:50000001-100000001.phased.vcf.gz   chr5:1-50000001.phased.log
chr14:100000001-107043718.gl.vcf.gz      chr5:1-50000001.phased.vcf.gz
chr14:100000001-107043718.gp.log         chr5:50000001-100000001.gl.vcf.gz
chr14:100000001-107043718.gp.vcf.gz      chr5:50000001-100000001.gp.log
chr14:100000001-107043718.phased.log     chr5:50000001-100000001.gp.vcf.gz
chr14:100000001-107043718.phased.vcf.gz  chr5:50000001-100000001.phased.log
chr14:1-50000001.gl.vcf.gz               chr5:50000001-100000001.phased.vcf.gz
chr14:1-50000001.gp.log                  chr6:100000001-150000001.gl.vcf.gz
chr14:1-50000001.gp.vcf.gz               chr6:100000001-150000001.gp.log
chr14:1-50000001.phased.log              chr6:100000001-150000001.gp.vcf.gz
chr14:1-50000001.phased.vcf.gz           chr6:100000001-150000001.phased.log
chr14:50000001-100000001.gl.vcf.gz       chr6:100000001-150000001.phased.vcf.gz
chr14:50000001-100000001.gp.log          chr6:150000001-170805979.gl.vcf.gz
chr14:50000001-100000001.gp.vcf.gz       chr6:150000001-170805979.gp.log
chr14:50000001-100000001.phased.log      chr6:150000001-170805979.gp.vcf.gz
chr14:50000001-100000001.phased.vcf.gz   chr6:150000001-170805979.phased.log
chr1:50000001-100000001.gl.vcf.gz        chr6:150000001-170805979.phased.vcf.gz
chr1:50000001-100000001.gp.log           chr6:1-50000001.gl.vcf.gz
chr1:50000001-100000001.gp.vcf.gz        chr6:1-50000001.gp.log
chr1:50000001-100000001.phased.log       chr6:1-50000001.gp.vcf.gz
chr1:50000001-100000001.phased.vcf.gz    chr6:1-50000001.phased.log
chr15:100000001-101991189.gl.vcf.gz      chr6:1-50000001.phased.vcf.gz
chr15:100000001-101991189.gp.log         chr6:50000001-100000001.gl.vcf.gz
chr15:100000001-101991189.gp.vcf.gz      chr6:50000001-100000001.gp.log
chr15:100000001-101991189.phased.log     chr6:50000001-100000001.gp.vcf.gz
chr15:100000001-101991189.phased.vcf.gz  chr6:50000001-100000001.phased.log
chr15:1-50000001.gl.vcf.gz               chr6:50000001-100000001.phased.vcf.gz
chr15:1-50000001.gp.log                  chr7:100000001-150000001.gl.vcf.gz
chr15:1-50000001.gp.vcf.gz               chr7:100000001-150000001.gp.log
chr15:1-50000001.phased.log              chr7:100000001-150000001.gp.vcf.gz
chr15:1-50000001.phased.vcf.gz           chr7:100000001-150000001.phased.log
chr15:50000001-100000001.gl.vcf.gz       chr7:100000001-150000001.phased.vcf.gz
chr15:50000001-100000001.gp.log          chr7:150000001-159345973.gl.vcf.gz
chr15:50000001-100000001.gp.vcf.gz       chr7:150000001-159345973.gp.log
chr15:50000001-100000001.phased.log      chr7:150000001-159345973.gp.vcf.gz
chr15:50000001-100000001.phased.vcf.gz   chr7:150000001-159345973.phased.log
chr16:1-50000001.gl.vcf.gz               chr7:150000001-159345973.phased.vcf.gz
chr16:1-50000001.gp.log                  chr7:1-50000001.gl.vcf.gz
chr16:1-50000001.gp.vcf.gz               chr7:1-50000001.gp.log
chr17:1-50000001.gl.vcf.gz               chr7:1-50000001.gp.vcf.gz
chr17:1-50000001.gp.log                  chr7:1-50000001.phased.log
chr17:1-50000001.gp.vcf.gz               chr7:1-50000001.phased.vcf.gz
chr17:1-50000001.phased.log              chr7:50000001-100000001.gl.vcf.gz
chr17:1-50000001.phased.vcf.gz           chr7:50000001-100000001.gp.log
chr17:50000001-83257441.gl.vcf.gz        chr7:50000001-100000001.gp.vcf.gz
chr17:50000001-83257441.gp.log           chr7:50000001-100000001.phased.log
chr17:50000001-83257441.gp.vcf.gz        chr7:50000001-100000001.phased.vcf.gz
chr17:50000001-83257441.phased.log       chr8:100000001-145138636.gl.vcf.gz
chr17:50000001-83257441.phased.vcf.gz    chr8:100000001-145138636.gp.log
chr18:1-50000001.gl.vcf.gz               chr8:100000001-145138636.gp.vcf.gz
chr18:1-50000001.gp.log                  chr8:100000001-145138636.phased.log
chr18:1-50000001.gp.vcf.gz               chr8:100000001-145138636.phased.vcf.gz
chr18:1-50000001.phased.log              chr8:1-50000001.gl.vcf.gz
chr18:1-50000001.phased.vcf.gz           chr8:1-50000001.gp.log
chr18:50000001-80373285.gl.vcf.gz        chr8:1-50000001.gp.vcf.gz
chr18:50000001-80373285.gp.log           chr8:1-50000001.phased.log
chr18:50000001-80373285.gp.vcf.gz        chr8:1-50000001.phased.vcf.gz
chr18:50000001-80373285.phased.log       chr8:50000001-100000001.gl.vcf.gz
chr18:50000001-80373285.phased.vcf.gz    chr8:50000001-100000001.gp.log
chr19.gl.vcf.gz                          chr8:50000001-100000001.gp.vcf.gz
chr19.gp.log                             chr8:50000001-100000001.phased.log
chr19.gp.vcf.gz                          chr8:50000001-100000001.phased.vcf.gz
chr19.phased.log                         chr9:100000001-138394717.gl.vcf.gz
chr19.phased.vcf.gz                      chr9:100000001-138394717.gp.log
chr20.gl.vcf.gz                          chr9:100000001-138394717.gp.vcf.gz
chr20.gp.log                             chr9:100000001-138394717.phased.log
chr20.gp.vcf.gz                          chr9:100000001-138394717.phased.vcf.gz
chr20.phased.log                         chr9:1-50000001.gl.vcf.gz
chr20.phased.vcf.gz                      chr9:1-50000001.gp.log
chr2:100000001-150000001.gl.vcf.gz       chr9:1-50000001.gp.vcf.gz
chr2:100000001-150000001.gp.log          chr9:1-50000001.phased.log
chr2:100000001-150000001.gp.vcf.gz       chr9:1-50000001.phased.vcf.gz
chr2:100000001-150000001.phased.log      chr9:50000001-100000001.gl.vcf.gz
chr2:100000001-150000001.phased.vcf.gz   chr9:50000001-100000001.gp.log
chr2:150000001-200000001.gl.vcf.gz       chr9:50000001-100000001.gp.vcf.gz
chr2:150000001-200000001.gp.log          chr9:50000001-100000001.phased.log
chr2:150000001-200000001.gp.vcf.gz       chr9:50000001-100000001.phased.vcf.gz
chr2:150000001-200000001.phased.log

I am working on cell free RNAseq, so no, not single cell RNAseq data. But I am using this tool, because I expect the quality of the data and also sparsity to be similar to scRNA.

jinzhuangdou commented 6 months ago

That makes sense for the long time running of Monopogen since cell free RNA-seq may have more genome regions covered than single cell data. Could you check whether all regions are finished? For example cat *.phased.log | grep finished | wc -l to see how many files finished. There may have some regions/segments failed and Monopogen is stuck on such regions

MetteBoge commented 6 months ago

cat *phased.log | grep finished | wc -l (base) 62

Same number of regions. So I dont think its because its still running. I checked on htop the amount of CPU the process use, and it seems like after the 24h (and more than 12h since any updates to output files), it is using very low amount / almost no CPU (compared to when it is writing files). Like I dont think it is actively doing anything else but just not quitting the job...

MetteBoge commented 6 months ago

And I was running it on 11 samples at the same time. Not one finished. With bulk variant callers I expect around 10k variants to be found in the bam files. So I doubt it is a problem with data size.

jinzhuangdou commented 6 months ago

Thanks for your examination. Do you mean you have only 10K variants across 22 chromosomes or in each region? I may debug this a bit.

MetteBoge commented 6 months ago

I ran DeepVariant on the data and post filtering (DP<10, might be too stringent...) I had 11k variants. I am still using the output vcf files from Monopogen germline, since they seem to be finished. Only problem is the job is not quitting. After filtering merged vcf from monopogen, I find approx 30K. So 3 fold difference, which is low but still much better. I would just really hope you can help me fix the not quitting job problem, so I can run it in a nextflow pipeline.

jinzhuangdou commented 6 months ago

Sure. Will feedback to you after I examine the job collection procedure.

Give you do not have too many markers, it is better to run imputation in one whole chromosome. You can achieve this by inputting the region list with chr1 chr2 ...

This will reduce the job collection complexity. Or you can remove the job collection module in and merge such phased vcf file out of Monopogen? Just run all sub jobs listed in joblst (Line 123-127) https://github.com/KChen-lab/Monopogen/blob/main/src/Monopogen.py

MetteBoge commented 6 months ago

My region list is the one from your resources :

chr1,1,50000001
chr1,50000001,100000001
chr1,100000001,150000001
chr1,150000001,200000001
chr1,200000001,248956422
chr2,1,50000001
chr2,50000001,100000001
chr2,100000001,150000001
chr2,150000001,200000001
chr2,200000001,242193529
chr3,1,50000001
chr3,50000001,100000001
chr3,100000001,150000001
chr3,150000001,198295559
chr4,1,50000001
chr4,50000001,100000001
chr4,100000001,150000001
chr4,150000001,190214555
chr5,1,50000001
chr5,50000001,100000001
chr5,100000001,150000001
chr5,150000001,181538259
chr6,1,50000001
...

Do you suggest I just run it as chr1 chr2 chr3, without the subdivision into chrlocation? As it is of now I just merge the 62 phased vcf files with bcftools concat. Seems to work fine

Thank you for looking into it !

jinzhuangdou commented 6 months ago

Yes, you can run imputation/phasing in one whole chromosome and it will further increase genotyping accuracy since the marker panel is very sparse.

MetteBoge commented 6 months ago

Thanks, I will try that

MetteBoge commented 6 months ago

@jinzhuangdou Hi again, I tried to run it again on different samples. I found out that I am in fact missing a .phased.log file (and phased.vcf file). And it seems to be for chr16 for all (checked a couple) of the samples. Could it be due to a flaw in the imputation panel file? From a quick look, it has about same size as the imputation panel file for chr17. Hope this can give you a clue to whats going on. Best regards