KolmogorovLab / Wakhan

Haplotype-specific somatic copy number aberrations/profiling from long reads sequencing data
MIT License
23 stars 0 forks source link

Error report #9

Open ywzhang071394 opened 2 weeks ago

ywzhang071394 commented 2 weeks ago

Hi,

Thank you for the nice tool! I am recently using severus for SV calling and was attracted by Wakhan. However, when I tried to run it on my Pacbio data, an error was reported. The error is: Traceback (most recent call last): File "main.py", line 396, in <module> main() File "main.py", line 218, in main main_process() #hapcorrect File "/storage1/fs1/Wakhan/src/hapcorrect/src/main_hapcorrect.py", line 297, in main_process haplotype_1_values_phasesets, haplotype_2_values_phasesets = snps_frequencies_chrom_mean_phasesets(df_snps_frequencies, ref_start_values_phasesets, ref_end_values_phasesets, chrom, args) File "/storage1/fs1/Wakhan/src/hapcorrect/src/process_vcf.py", line 268, in snps_frequencies_chrom_mean_phasesets sub_list = haplotype_1_coverage[haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-i))):haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-j)))] ValueError: min() arg is an empty sequence

My code is: python main.py --threads 10 --reference chm13v2.0.fa --target-bam tumor_haplotagSV.bam --out-dir-plots wakhan_pacbio3/ --normal-phased-vcf deepvariant.hiphased.vcf.gz --smoothing-enable --contigs chr1-22,X,Y --pdf-enable --loh-enable --genome-name HT522_Tumor --tumor-vcf deepvariant.vcf.gz --breakpoints severus_all.vcf

I have no idea about this error. Could you help to check? Thank you so much!

tahashmi commented 1 week ago

Hi, Thanks for reporting it!

Can you share the output log of Wakhan? is it when processing chrY? In this case, can you comment this line and rerun it?

Also, could you try without chrY in --contigs params?

ywzhang071394 commented 1 week ago

Hi,

Thank you for your prompt response! I followed your suggestion of removing chrY and it works well now. I got three output directories with the name comprising ploidy, purity and confidence. However, one of them is _1.86_1.0nan. I was confused by the "nan" value. Could you help to interpret it?

Below is the error log when chrY was included. just for your reference.


Parsing reads from tumor_haplotagSV.bam
Parsed 7778187 segments
INFO:root:Computing coverage histogram
INFO:root:Computing coverage for bins
INFO:root:Writing coverage for bins
INFO:root:Parsing phaseblocks information
INFO:root:bcftools -> Filtering out hetrozygous and phased SNPs and generating a new VCF
INFO:root:bcftools -> Query for phasesets and GT, DP, VAF feilds by creating a CSV file
INFO:root:Computing coverage for phaseblocks
INFO:root:Writing coverage for phaseblocks
INFO:root:Loading coverage (bins) and coverage (phaseblocks) files...
INFO:root:bcftools -> Query for het SNPs and creating a /storage1/fs1/data_phasing/deepvariant.phased.vcf_het_snps.csv CSV file
INFO:root:SNPs frequency -> CSV to dataframe conversion
INFO:root:SNPs frequency -> Comuting het SNPs frequency from tumor BAM
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
INFO:root:bcftools -> Query for phasesets and GT, DP, VAF feilds by creating a CSV file
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr1
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr2
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr3
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr4
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr5
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr6
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr7
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr8
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr9
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr10
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr11
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr12
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr13
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr14
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr15
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr16
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr17
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr18
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr19
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr20
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr21
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chr22
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chrX
INFO:root:Loading coverage (bins) and coverage (phaseblocks) datasets for chrY
Traceback (most recent call last):
  File "main.py", line 396, in <module>
    main()
  File "main.py", line 218, in main
    main_process() #hapcorrect
  File "/storage1/fs1/software/Wakhan/src/hapcorrect/src/main_hapcorrect.py", line 297, in main_process
    haplotype_1_values_phasesets, haplotype_2_values_phasesets = snps_frequencies_chrom_mean_phasesets(df_snps_frequencies, ref_start_values_phasesets, ref_end_values_phasesets, chrom, args)
  File "/storage1/fs1/software/Wakhan/src/hapcorrect/src/process_vcf.py", line 268, in snps_frequencies_chrom_mean_phasesets
    sub_list = haplotype_1_coverage[haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-i))):haplotype_1_position.index(min(haplotype_1_position, key=lambda x:abs(x-j)))]
ValueError: min() arg is an empty sequence
tahashmi commented 1 week ago

Thanks!

nan was due to mean was being applied to some null lists values. Can you just use the latest github repo? Also, please try using chrY now.

ywzhang071394 commented 1 week ago

Hi,

I tried your latest verion. All procedures work well. However, the result is somehow wired. As shown in the below figure, chrY has no copy but our sample is male. image

Some other issues are:

  1. A clear bed output was provided but no CN change between tumor and control samples (e.g. logR) was included.
  2. Is your default ploidy set to 2? Tumor samples have a high frequency of whole genome duplication. Is that possible to consider this?

Thanks!

ywzhang071394 commented 1 week ago

Also, did the orange and red bars stand for HP1 in different clones? If so, I suppose that every chromosome region should have both bars to illustrate the clonal difference.

tahashmi commented 1 week ago

Hi,

Can you check if your normal phased VCF and tumor BAM has chrY data? Also can you enable the bins in the plot to see if no bin is included in chrY.

  1. What do mean by clear bed output (is CN bed is empty?), I can see CN state 2 in chr5 and LOH in some other chrms?
  2. Default ploidy is 2, but you see different solution folder which are based on different purity/ploidy possibilities. (enabling the bins will help to select the best from them), We are basically, optimizing normal fraction for different values.

Wakhan produces two CN *.html outputs, only integers CN states and subclonal (light blue and orange) CN states. Can you check the other output without subclonal states?

tahashmi commented 1 week ago

User can input now purity/ploidy ranges in the latest GitHub update, also regarding chrY, maybe most of the reads reads are unphased and not being included in HP1/HP2 coverage

ywzhang071394 commented 1 week ago

Hi,

Thank you for the follow-up!

  1. we have chrY data as shown in the below phasing output. image

  2. The clear bed output refers to the aboslute CN numbers. What I am expecting is the inclusion of log ratio of CN and BAF, which are not seen in the output.

  3. what is the "bins" you mean? is this "--variable-size-bins"?

tahashmi commented 1 week ago

Bin (default 50k bps) means these small dots in the plots.

Will update log ratio of CN and BAF in future release.

ywzhang071394 commented 1 week ago

Another issue is how to understand the output *copynumbers_subclonal_segments.bed. image

For example, in the figure above, how should we interpret the values in the "copynumber_state" column? My understanding is that these represent absolute copy number (CN) values. However, I’m curious why some of the values are so large.

tahashmi commented 1 week ago

In subclonal beds, if a copynumber states is equal to it's coverage, it means this segment is marked as subclonal and not absolute (no integer copynumber state is assigned to this segment, confidence score is also low for such segments).

ywzhang071394 commented 1 week ago

Got you! Thanks!

ywzhang071394 commented 1 week ago

Hi,

Below is the screenshot of the chrY output. It shows that the read coverages on chrY were low. However, as per the figure I posted above, most of the chrY segments should have sufficient coverages. Do you know why this happen? Is this still a bug? image

tahashmi commented 1 week ago

Hi, can you show breakpoints plots (<>_genome_copynumbers_breakpoints.html) by enabling CN and coverage? This can give better idea why coverage 25 in first segment (0-10100001) is not being detected as CN=1?

ywzhang071394 commented 6 days ago

I think the first segment is correctly showned with 1 copy number in the figure (see the red arrow in the below figure) but not correct in the bedoutput. Picture1

I also attached the coverage plot of chrY here, which shows that the chrY read coverage is very shallow. However, according to the results from CNVkitcnvkit.txt , the coverage is good enough.

image

my code is

python main.py   --threads 10  --reference path2ref/chm13v2.0.fa   --target-bam tumor_haplotagSV.bam \
    --out-dir-plots wakhan_pacbio_new2/ \
    --normal-phased-vcf 03_GermlineSNV/deepvariant.phased.vcf.gz \
    --contigs chr1-22,X,Y --pdf-enable --loh-enable --copynumbers-subclonal-enable \
    --genome-name HT522_Tumor --tumor-vcf deepvariant.vcf.gz \
    --breakpoints severus_all.vcf --centromere T2T.centromere.bed
tahashmi commented 6 days ago

okay, this is due to only one are few breakpoints positions at changing coverage locations in chrY, we are working to enable another algo for such cases, meanwhile can you try to run adding param --cpd-internal-segments

ywzhang071394 commented 6 days ago

ok, i will try, but what file should be used for this param. Also, i am trying to add the bins (points) into the figures, which params should be set?

ywzhang071394 commented 6 days ago

Or how can I output the file of bins?

tahashmi commented 6 days ago

Bins are already in the plots, just select/enable the HP-1/HP-2 (dots) in the plot legend. If you are using latest github code these bins files should be in coverage_data dir in output with name coverage.csv. In previous updates, I was deleting them.

Just include --cpd-internal-segments, without any input file.

ywzhang071394 commented 5 days ago

I tested the latest version and found that --cpd-internal-segments requires a "path" otherwise the tool cannot run. Also, the latest version still cannot generate a correct results of ChrY. Looking foward to your update! Thanks!

tahashmi commented 5 days ago

I have updated --cpd-internal-segments few hours before, please update and rerun it.

tahashmi commented 5 days ago

Hi, What you seeing coverage output (coverage.csv) is basically before phase correction, I have updated the github code to output coverage after phase-correction as well, it will be stored in same coverage_data dir with phase_corrected_coverage.csv file.

ywzhang071394 commented 5 days ago

I also modified the plot.py script (line 236) and exported the adjusted values. That's why I deleted the comment. Thank you so much! --cpd-internal-segments was also tested, but chrY was still incorrect as shown in the below figure. image

ywzhang071394 commented 4 days ago

Also, how can I convert the coverage to copy number values in the figure? I am wondering how to find the correspondance between the left coverage scale and the right CN scale in the above figure. Especially for subclonal data, we want to plot the subclonal CNs but only coverage data was provided. For example: chr start end coverage copynumber_state confidence haplotype chr2 242250001 242450000 10.57 10 0.5162451 1

I am trying to convert the copynumber_state 10 to a copy number and plot it together with other clonal copy numbers.

tahashmi commented 4 days ago

I will add an extra bed file for that copy number and coverage correspondence.

What do you suggest, how to represent subclonals states in bed, instead of their coverage value?

tahashmi commented 4 days ago

Also, if possible can you provide chrY data (coverage.csv and coverage_ps.csv and pileup_SNPs.csv) ? I want to debug.

ywzhang071394 commented 2 days ago

Hi,

I ran Wakhan on all my samples and would like to express my big concern about your CN calculation for the p-arms of the acrocentric chromosomes 13, 14, 15, 21 and 22. CN loss of these regions were reported in almost all of my samples by Wakhan, but this result is highly unlikely. We know these regions always show low coverages compared to other regions, largely due to centromere proximity and low sequence complexity. However, both tumor and normal samples showed similarly shallow coverages. This raises a challenge in classifying these regions as CNV regions.

tahashmi commented 2 days ago

Right, Is it possible to share those problematic data CN *_genome_copynumbers_breakpoints.htmls and *_genome_copynumbers_breakpoints_subclonal.htmls? I just want to see what I can do for any improvement in CN model for such cases. If you want to share them privately here is my email. Thanks

ywzhang071394 commented 2 days ago

I have sent one of my problematic samples to you. Our data is protected carefully, so I cannot share more samples with you.

tahashmi commented 2 days ago

No issue. I also need normal and tumor VCF files (you used in command line) for that sample.