KChen-lab / Monopogen

SNV calling from single cell sequencing
GNU General Public License v3.0
71 stars 17 forks source link

samtools mpileup #9

Open joonan30 opened 1 year ago

joonan30 commented 1 year ago

Hi

Thanks for the tool. It's really nice. One thing I noted is that samtools in your code use old version and mpileup -t and -v is deprecated in the latest version (v 1.18). I am currently using 1.13 binary on my M1 mac, which seems working expectedly but the run with 1.18 is not working.

Are you updating this with bcftools mpileup in future?

[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
jinzhuangdou commented 1 year ago

Currently monpogen only supports the tools embedded in the app folder. This is because different version of bcftools/samtools alway lead to different output format. This makes us hard to manage the downstream analysis.

swvanderlaan commented 6 months ago

Hi,

@jinzhuangdou

But... aren't these versions heavily dependent on the specific operating system too? I am also struggling to install and run the tool on my Mac (macOS Sonoma 14.4, MacBook Pro 2021, M1 Pro).

What is the expected output format after the preProfess step? Those are just regular .bam and .bai files, or not?

And even if the above, using a newer version, is not possible: what should be the content of a conda-environment? Simply installing the suggestion version listed in the readme doesn't help. For instance, the location of libdeflate.so cannot be found - and the version in the apps-folder doesn't work with this particular version of bcftools presumably because it was build for a different OS.

swvanderlaan commented 6 months ago

Hi @jinzhuangdou ,

Because of the above, I tinkered with the original code. You did supply an example input file, but you didn't provide the expected output-files. Do you have these so that I can compare your original example output with mine?

The code in Monopogen.py was:

# ORIGINAL COMMANDS with OLD version of samtools
cmd1 = samtools + " mpileup -b " + bam_filter + " -f "  + args.reference  + " -r " +  jobid + " -q 20 -Q 20 -t DP -d 10000000 -v "
cmd1 = cmd1 + " | " + bcftools + " view " + " | "  + bcftools  + " norm -m-both -f " + args.reference 
cmd1 = cmd1 + " | grep -v \"<X>\" | grep -v INDEL | " + bgzip +   " -c > " + args.out + "/germline/" +  jobid + ".gl.vcf.gz" 

This is the updated code:

# NEW COMMANDS with bcftools
# https://samtools.github.io/bcftools/bcftools.html 
# https://www.biostars.org/p/425139/
# https://www.biostars.org/p/418738/
# mpileup a single region
# -b list of input BAM files
# -f reference sequence
# -r region to include
# -q base quality
# -Q mapping quality
# --annotate FORMAT/DP
# view to filter the output
# norm to normalize the output
# grep to remove unwanted lines
# bgzip to compress the output
# -c to write to stdout
# > to redirect to a file
# this can also be done using bcftools view -Oz -o
cmd1 = bcftools + " mpileup -b " + bam_filter + " -f "  + args.reference  + " -r " +  jobid + " -q 20 -Q 20 --annotate FORMAT/DP "
cmd1 = cmd1 + " | " + bcftools + " view " + " | "  + bcftools  + " norm -m-both -f " + args.reference
# bgzip version; works, but I believe the below command is better
# cmd1 = cmd1 + " | grep -v \"<X>\" | grep -v INDEL | " + bgzip +   " -c > " + args.out + "/germline/" +  jobid + ".gl.vcf.gz" 
# bcftools version
cmd1 = cmd1 + " | grep -v \"<X>\" | grep -v INDEL | " + bcftools +   " view -Oz -o " + args.out + "/germline/" +  jobid + ".gl.vcf.gz" 

Is this new code what you originally intended to do in this particular step? You'll note that I dropped -d 10000000 because bcftools indicated it may represent a huge memory hog.

Also attached is the output I have with the new command.

chr20.gl.vcf.gz

swvanderlaan commented 6 months ago

Hi @jinzhuangdou ,

The next step - imputation with beagle - yields a file which looks different from what is mentioned in the readme.md file. I don't know if that (the readme) is old, or whether my new chr20.gl.vcf.gz is wrong. In either case, here's the output I get.

This file is actually chr20.germline.vcf but I zipped it (using macOS) to be able to share it.

chr20.germline.vcf.zip

These are the other files.

chr20.gp.log chr20.gp.vcf.gz chr20.phased.log chr20.phased.vcf.gz

jinzhuangdou commented 6 months ago

Hello swvanderlaan, Sorry that we forgot to update the test dataset since we removed 0/0 locus for single sample calling. Your gp.vcf.gz is similar with what I have. The test dataset has two samples and some loci were mistakenly removed in the phasing step. This issue has been fixed and attached is what I have got for the test data (germline calling). It is similar with yours. chr20.phased.vcf.gz chr20.gl.vcf.gz chr20.gp.vcf.gz

swvanderlaan commented 6 months ago

This is great @jinzhuangdou . This means the approach I took with respect to samtools, vcftools, and bcftools may be robust and thus Monopogen.py and germline.py could be adjusted to reflect that. In other words, Monopogen could work in the context of brew which would make it more flexible and generic - I think.

But... can you confirm that the way I changed the code here, is correct? Is this what you intended to do with each step?

The things I am uncertain about are the following. In the original code it said -t DP -d 10000000 -v.

jinzhuangdou commented 6 months ago

Hello Swvanderlaan,

I think your modifications largely reproduce what I intend to do. -t DP was set to include the sequencing depth for output file. --annotate FORMAT/DP should perform similar role in the new bcftools version. -d was set for SNV calling with the maximal sequencing depth. Unlike SNV calling from WGS/WES, sequencing depth is not even for scRNA-seq. For example, sequencing reads can reach >1000 for house-keeping genes if you have around 10K cells. I set an infinite value to make sure all reads included. -v generate genotype likelihoods in VCF format.

Screenshot 2024-02-28 at 4 39 06 PM

envest commented 5 months ago

Thank you @jinzhuangdou + team for producing this tool! And thank you @swvanderlaan for your helpful edits (#42)!

One thing I noticed when using @swvanderlaan 's updated code that allows for flexible version of samtools and bcftools is that some INFO tags in the VCF output may differ slightly from @jinzhuangdou 's expected outputs. This is important in the somatic module because somatic.py expects an exact ID for each INFO field.

@jinzhuangdou 's version (using samtools) includes these fields in chr20.gl.vcf.gz:

##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">

and @swvanderlaan 's version (using bcftools) includes these fields:

##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">

The somatic.py script expects INFO tags RPB, MQB, BQB, and MQSB (lines 106-109).

It's unclear to me when this difference in tags was introduced. Interestingly, bcftools mpileup documentation (version 1.17) indicates that the output option -U, mwu-u will revert the new tags (with Z) to the previous format (without Z). However, that option was not available in my version of bcftools (version 1.19 using htslib 1.19.1).

I was able to resolve the issue by editing the expected tag names in somatic.py at lines 106-109 and 118-119 (@swvanderlaan 's version) to include a Z. I can imagine a flexible specification that could handle tag names with or without Z, which would match the version-agnostic spirit of @swvanderlaan 's code changes. The tests to calculate each metric are different (RPB is different from RPBZ, etc.) but they are supposed to approximate the same thing (*Z is based on a Z-score).

Again, thanks to all for your effort making this tool great!