fritzsedlazeck / Spectre

Copy number caller for long read data including SNV utilization
MIT License
52 stars 3 forks source link

Skips all het SNV sites because deepvariant uses Number=A for the VAF format field. #19

Open JakeHagen opened 5 months ago

JakeHagen commented 5 months ago

Hello and thank you for releasing this tool.

The code below (which I am assuming is there to filter multi-allelic sites) causes issues when using Deepvariant's VCFs. Deepvariant uses Number=A for the format VAF field which causes pysam to always parse it as a tuple, even if it is not a multi-allelic site.

if type(snv[self.af_tag]) == tuple:
    continue

I was able to get spectre to run as expected by manually changing the VCF header to Number=1 for VAF

I wonder if you would be open to figuring out another way to filter multi-allelic sites. Maybe something like

if len(snv_record.alts) > 1:
    continue

Thanks Jake

lfpaulin commented 5 months ago

Hello Jake, thanks for pointing this out. Is it possible that you could share use one or two examples of SNP entries with that format so that we could catch those cases too. We only tested with bi-allelic sites. It is also worth noticing that multi allelic sites may not work during LoH calling, or at least the current version will bot be able to work. Best Luis

JakeHagen commented 5 months ago

Hi Luis, thank you for your response

Attached are two test vcfs. They have only one identical variant. The only difference is the FORMAT VCF header line ie: ##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions."> vs ##FORMAT=<ID=VAF,Number=1,Type=Float,Description="Variant allele fractions.">

Please note that the only variant in the vcfs is a bi-allelic variant, the variants I was feeding into spectre were already filtered to bi-allelic sites.

Also below is the error I get when using the non-modified vcf. Using the modified version allows spectre to run to completion.

Matplotlib created a temporary cache directory at /tmp/matplotlib-7lrdbiqb because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
spectre::2024-04-19 14:44:00,654::INFO::spectre.spectre>  Spectre input: CNVCaller --coverage ../mosdepth/test.regions.bed.gz --sample-id test -r ./GRCh38.fasta --output-dir ./test --blacklist ./grch38_blacklist_spectre.bed --metadata ./metadata.mdr --snfj ../snf2json/test.snfj --snv dv16.vcf.gz --only-chr chr1
spectre::2024-04-19 14:44:00,655::INFO::spectre.spectre>  Spectre version: 0.2.0
spectre::2024-04-19 14:44:00,655::INFO::spectre.spectre>  Starting spectre
spectre::2024-04-19 14:44:00,655::INFO::spectre.spectre>  Evaluate if a new .mdr file needs to be created
spectre::2024-04-19 14:44:00,655::INFO::spectre.spectre>  Using existing metadata file ./metadata.mdr
spectre::2024-04-19 14:44:00,656::INFO::spectre.spectreCNV>  Spectre calculating for: ../mosdepth/test.regions.bed.gz
spectre::2024-04-19 14:44:00,656::INFO::spectre.spectreCNV>  Analysing CN neutral state from SNV data
spectre::2024-04-19 14:44:00,658::DEBUG::spectre.analysis.snv_analysis>  Starting SNV Copy number state compute ...
spectre::2024-04-19 14:44:00,658::DEBUG::spectre.analysis.snv_analysis>  Het sites count per chromosome
spectre::2024-04-19 14:44:00,658::DEBUG::spectre.analysis.snv_analysis>  Use chromsomes based on proportion of het sites
spectre::2024-04-19 14:44:00,658::DEBUG::spectre.analysis.snv_analysis>  n het sites used for normalization = 0
Traceback (most recent call last):
  File "/usr/local/bin/spectre", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/site-packages/spectre/spectre.py", line 637, in main
    spectre_run.spectre_exe()
  File "/usr/local/lib/python3.12/site-packages/spectre/spectre.py", line 356, in spectre_exe
    spectre_main.cnv_call()
  File "/usr/local/lib/python3.12/site-packages/spectre/spectreCNV.py", line 71, in cnv_call
    self.snv_analysis.snv_copy_number_state()
  File "/usr/local/lib/python3.12/site-packages/spectre/analysis/snv_analysis.py", line 231, in snv_copy_number_state
    self.snv_multimodal_detect()
  File "/usr/local/lib/python3.12/site-packages/spectre/analysis/snv_analysis.py", line 104, in snv_multimodal_detect
    self.het_clean() if np.max(self.perfect_het_depth) > 10*np.nanmean(self.perfect_het_depth) else None
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/numpy/core/fromnumeric.py", line 2810, in max
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/numpy/core/fromnumeric.py", line 88, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: zero-size array to reduction operation maximum which has no identity

Thanks Jake

dv16_mod.vcf.gz dv16.vcf.gz

lfpaulin commented 5 months ago

Hello Jake, thanks for the detailed information, I will check it out and reach back asap.

Luis