Open ywzhang071394 opened 2 weeks ago
Hi, Thanks for reporting it!
Can you share the output log of Wakhan? is it when processing chrY? In this case, can you comment this line and rerun it?
Also, could you try without chrY in --contigs
params?
Hi,
Thank you for your prompt response! I followed your suggestion of removing chrY and it works well now. I got three output directories with the name comprising ploidy, purity and confidence. However, one of them is _1.86_1.0nan. I was confused by the "nan" value. Could you help to interpret it?
Below is the error log when chrY was included. just for your reference.
Parsing reads from tumor_haplotagSV.bam
Parsed 7778187 segments
INFO:root:Computing coverage histogram
INFO:root:Computing coverage for bins
INFO:root:Writing coverage for bins
INFO:root:Parsing phaseblocks information
INFO:root:bcftools -> Filtering out hetrozygous and phased SNPs and generating a new VCF
INFO:root:bcftools -> Query for phasesets and GT, DP, VAF feilds by creating a CSV file
INFO:root:Computing coverage for phaseblocks
INFO:root:Writing coverage for phaseblocks
INFO:root:Loading coverage (bins) and coverage (phaseblocks) files...
INFO:root:bcftools -> Query for het SNPs and creating a /storage1/fs1/data_phasing/deepvariant.phased.vcf_het_snps.csv CSV file
INFO:root:SNPs frequency -> CSV to dataframe conversion
INFO:root:SNPs frequency -> Comuting het SNPs frequency from tumor BAM
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
INFO:root:bcftools -> Query for phasesets and GT, DP, VAF feilds by creating a CSV file
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr1
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr2
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr3
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr4
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr5
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr6
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr7
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr8
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr9
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr10
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr11
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr12
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr13
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr14
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr15
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr16
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr17
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr18
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr19
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr20
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr21
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr22
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chrX
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chrY
Traceback (most recent call last):
File "main.py", line 396, in <module>
main()
File "main.py", line 218, in main
main_process() #hapcorrect
File "/storage1/fs1/software/Wakhan/src/hapcorrect/src/main_hapcorrect.py", line 297, in main_process
haplotype_1_values_phasesets, haplotype_2_values_phasesets = snps_frequencies_chrom_mean_phasesets(df_snps_frequencies, ref_start_values_phasesets, ref_end_values_phasesets, chrom, args)
File "/storage1/fs1/software/Wakhan/src/hapcorrect/src/process_vcf.py", line 268, in snps_frequencies_chrom_mean_phasesets
sub_list = haplotype_1_coverage[haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-i))):haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-j)))]
ValueError: min() arg is an empty sequence
Thanks!
nan was due to mean was being applied to some null lists values. Can you just use the latest github repo? Also, please try using chrY now.
Hi,
I tried your latest verion. All procedures work well. However, the result is somehow wired. As shown in the below figure, chrY has no copy but our sample is male.
Some other issues are:
Thanks!
Also, did the orange and red bars stand for HP1 in different clones? If so, I suppose that every chromosome region should have both bars to illustrate the clonal difference.
Hi,
Can you check if your normal phased VCF and tumor BAM has chrY data? Also can you enable the bins in the plot to see if no bin is included in chrY.
Wakhan produces two CN *.html outputs, only integers CN states and subclonal (light blue and orange) CN states. Can you check the other output without subclonal states?
User can input now purity/ploidy ranges in the latest GitHub update, also regarding chrY, maybe most of the reads reads are unphased and not being included in HP1/HP2 coverage
Hi,
Thank you for the follow-up!
we have chrY data as shown in the below phasing output.
The clear bed output refers to the aboslute CN numbers. What I am expecting is the inclusion of log ratio of CN and BAF, which are not seen in the output.
what is the "bins" you mean? is this "--variable-size-bins"?
Bin (default 50k bps) means these small dots in the plots.
Will update log ratio of CN and BAF in future release.
Another issue is how to understand the output *copynumbers_subclonal_segments.bed.
For example, in the figure above, how should we interpret the values in the "copynumber_state" column? My understanding is that these represent absolute copy number (CN) values. However, I’m curious why some of the values are so large.
In subclonal beds, if a copynumber states is equal to it's coverage, it means this segment is marked as subclonal and not absolute (no integer copynumber state is assigned to this segment, confidence score is also low for such segments).
Got you! Thanks!
Hi,
Below is the screenshot of the chrY output. It shows that the read coverages on chrY were low. However, as per the figure I posted above, most of the chrY segments should have sufficient coverages. Do you know why this happen? Is this still a bug?
Hi, can you show breakpoints plots (<>_genome_copynumbers_breakpoints.html
) by enabling CN and coverage? This can give better idea why coverage 25 in first segment (0-10100001) is not being detected as CN=1?
I think the first segment is correctly showned with 1 copy number in the figure (see the red arrow in the below figure) but not correct in the bedoutput.
I also attached the coverage plot of chrY here, which shows that the chrY read coverage is very shallow. However, according to the results from CNVkitcnvkit.txt , the coverage is good enough.
my code is
python main.py --threads 10 --reference path2ref/chm13v2.0.fa --target-bam tumor_haplotagSV.bam \
--out-dir-plots wakhan_pacbio_new2/ \
--normal-phased-vcf 03_GermlineSNV/deepvariant.phased.vcf.gz \
--contigs chr1-22,X,Y --pdf-enable --loh-enable --copynumbers-subclonal-enable \
--genome-name HT522_Tumor --tumor-vcf deepvariant.vcf.gz \
--breakpoints severus_all.vcf --centromere T2T.centromere.bed
okay, this is due to only one are few breakpoints positions at changing coverage locations in chrY, we are working to enable another algo for such cases, meanwhile can you try to run adding param --cpd-internal-segments
ok, i will try, but what file should be used for this param. Also, i am trying to add the bins (points) into the figures, which params should be set?
Or how can I output the file of bins?
Bins are already in the plots, just select/enable the HP-1/HP-2 (dots) in the plot legend. If you are using latest github code these bins files should be in coverage_data
dir in output with name coverage.csv
. In previous updates, I was deleting them.
Just include --cpd-internal-segments
, without any input file.
I tested the latest version and found that --cpd-internal-segments
requires a "path" otherwise the tool cannot run. Also, the latest version still cannot generate a correct results of ChrY. Looking foward to your update! Thanks!
I have updated --cpd-internal-segments
few hours before, please update and rerun it.
Hi, What you seeing coverage output (coverage.csv
) is basically before phase correction, I have updated the github code to output coverage after phase-correction as well, it will be stored in same coverage_data
dir with phase_corrected_coverage.csv
file.
I also modified the plot.py script (line 236) and exported the adjusted values. That's why I deleted the comment. Thank you so much! --cpd-internal-segments was also tested, but chrY was still incorrect as shown in the below figure.
Also, how can I convert the coverage to copy number values in the figure? I am wondering how to find the correspondance between the left coverage scale and the right CN scale in the above figure. Especially for subclonal data, we want to plot the subclonal CNs but only coverage data was provided. For example: chr start end coverage copynumber_state confidence haplotype chr2 242250001 242450000 10.57 10 0.5162451 1
I am trying to convert the copynumber_state 10 to a copy number and plot it together with other clonal copy numbers.
I will add an extra bed file for that copy number and coverage correspondence.
What do you suggest, how to represent subclonals states in bed, instead of their coverage value?
Also, if possible can you provide chrY data (coverage.csv and coverage_ps.csv and pileup_SNPs.csv) ? I want to debug.
Hi,
I ran Wakhan on all my samples and would like to express my big concern about your CN calculation for the p-arms of the acrocentric chromosomes 13, 14, 15, 21 and 22. CN loss of these regions were reported in almost all of my samples by Wakhan, but this result is highly unlikely. We know these regions always show low coverages compared to other regions, largely due to centromere proximity and low sequence complexity. However, both tumor and normal samples showed similarly shallow coverages. This raises a challenge in classifying these regions as CNV regions.
Right, Is it possible to share those problematic data CN *_genome_copynumbers_breakpoints.htmls
and *_genome_copynumbers_breakpoints_subclonal.htmls
? I just want to see what I can do for any improvement in CN model for such cases.
If you want to share them privately here is my email. Thanks
I have sent one of my problematic samples to you. Our data is protected carefully, so I cannot share more samples with you.
No issue. I also need normal and tumor VCF files (you used in command line) for that sample.
Hi,
Thank you for the nice tool! I am recently using severus for SV calling and was attracted by Wakhan. However, when I tried to run it on my Pacbio data, an error was reported. The error is:
Traceback (most recent call last): File "main.py", line 396, in <module> main() File "main.py", line 218, in main main_process() #hapcorrect File "/storage1/fs1/Wakhan/src/hapcorrect/src/main_hapcorrect.py", line 297, in main_process haplotype_1_values_phasesets, haplotype_2_values_phasesets = snps_frequencies_chrom_mean_phasesets(df_snps_frequencies, ref_start_values_phasesets, ref_end_values_phasesets, chrom, args) File "/storage1/fs1/Wakhan/src/hapcorrect/src/process_vcf.py", line 268, in snps_frequencies_chrom_mean_phasesets sub_list = haplotype_1_coverage[haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-i))):haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-j)))] ValueError: min() arg is an empty sequence
My code is:
python main.py --threads 10 --reference chm13v2.0.fa --target-bam tumor_haplotagSV.bam --out-dir-plots wakhan_pacbio3/ --normal-phased-vcf deepvariant.hiphased.vcf.gz --smoothing-enable --contigs chr1-22,X,Y --pdf-enable --loh-enable --genome-name HT522_Tumor --tumor-vcf deepvariant.vcf.gz --breakpoints severus_all.vcf
I have no idea about this error. Could you help to check? Thank you so much!