PacificBiosciences / HiFiCNV

Copy number variant caller and depth visualization utility for PacBio HiFi reads
BSD 3-Clause Clear License
35 stars 4 forks source link

Error: No usable bins found for haploid depth estimation #25

Closed fellen31 closed 5 months ago

fellen31 commented 6 months ago

Hi,

I'm trying to run HifiCNV (0.1.7-70e9988) on a subset of data, which would be great for testing. I've tried to use --exclude to exclude all regions that are empty in the BAM-file, but still get this error:

thread 'main' panicked at 'No usable bins found for haploid depth estimation.', src/copy_number_segmentation.rs:93:5

The BAM-file is generated like this:

samtools view -M -L test_data.bed HG003_PacBio_GRCh38.bam -h -O BAM > HG003.bam

test_data.bed:

chr20   2642593 2668393 NOP56
chrX    67534021        67740619        AR
chr5    70915030        70963942        SMN1
chr5    70039638        70088522        SMN2

Thanks a lot!

hificnv \
       \
      --bam HG003.bam \
      --expected-cn expected_cn.hg38.XX.bed \
      --exclude test_profile.hificnv.exclude_regions.bed \
      --maf HG003.sorted.vcf.gz \
      --ref GRCh38_no_alt_analysis_set.fasta \
      --threads 8 \
      --output-prefix HG003
[2024-03-14][09:18:48][hificnv][INFO] Starting hificnv
  [2024-03-14][09:18:48][hificnv][INFO] cmdline: hificnv --bam HG003.bam --expected-cn expected_cn.hg38.XX.bed --exclude test_profile.hificnv.exclude_regions.bed --maf HG003.sorted.vcf.gz --ref GRCh38_no_alt_analysis_set.fasta --threads 8 --output-prefix HG003
  [2024-03-14][09:18:48][hificnv][INFO] Running on 8 threads
  [2024-03-14][09:18:48][hificnv][INFO] Reading reference genome from file 'GRCh38_no_alt_analysis_set.fasta'
  [2024-03-14][09:18:59][hificnv][INFO] Reading excluded regions from file 'test_profile.hificnv.exclude_regions.bed'
  [2024-03-14][09:18:59][hificnv][INFO] Reading expected CN regions from file 'expected_cn.hg38.XX.bed'
  [2024-03-14][09:18:59][hificnv][INFO] Processing alignment file 'HG003.bam'
  [2024-03-14][09:22:09][hificnv][INFO] Writing depth track to bigwig file: 'HG003.Sample0.depth.bw'
  [2024-03-14][09:22:09][hificnv][INFO] Scanning minor allele frequency data from file 'HG003.sorted.vcf.gz'
  [2024-03-14][09:22:09][hificnv][INFO] Writing bigwig maf track to file: 'HG003.HG003.maf.bw'
  [2024-03-14][09:22:09][hificnv][INFO] Segmenting copy number
  thread 'main' panicked at 'No usable bins found for haploid depth estimation.', src/copy_number_segmentation.rs:93:5
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ctsa commented 6 months ago

Hi @fellen31 - The short answer is that the CNV calling method isn't going to work well on narrowly targeted data because it needs enough sequence information to set an expected haploid coverage level.

We can certainly provide a better error message for this case as a first step to clarifying the issue. A quick workaround should be to expand the selected regions from the bam file, but I can't immediately clarify how much expansion is needed. We can see whether we can produce this from the bed file to make a more specific suggestion.

fellen31 commented 6 months ago

Thanks for the quick reply. Is the expected haploid coverage set using the whole genome, and not just the regions that are not excluded with --exclude?

A warning instead of a panic would be great, but I also understand if you need a hard cutoff as to not provide inaccurate results. Another workaround would be to extract the regions from the reference, make each region into a pseudo chromosome, align the reads and then run HiFiCNV against that, which will happily give me some (inaccurate) calls.

If you have any ideas on how a minimal test dataset could look like, that would be great.

ctsa commented 6 months ago

I tried to reproduce this issue: without any exclude file there are plenty of bins available for haploid depth estimation (whether it's accurate in this context is another matter), so for your case it seems likely that your exclude file could be filtering out the regions in test_data.bed. Do you think this could be happening?

Haploid coverage is estimated from the whole non-excluded genome, but with a zero-depth exclusion added to make the process more stable for cases like this. We include a bit more detail on it in methods:

https://github.com/PacificBiosciences/HiFiCNV/blob/main/docs/methods.md#segmentation

fellen31 commented 6 months ago

Thanks for your help. Good to hear that it seems to work for you. Even without the exclude file I can't seem to get it to work with my data though. Would it be possible for you to share the BAM-file that is working? Or the number of reads/coverage you have, if that could be the issue?

ctsa commented 6 months ago

Can you clarify -- do you still see the error when you don't use any exclude file with the HG003 bam file you've described?

fellen31 commented 5 months ago

Hi, sorry for the late reply. I realised it was actually HG002, but yes I still saw the error. Solved it by using a smaller reference genome. Feel free to close the issue.

ctsa commented 5 months ago

Okay good to hear. From the above conversation it sounds like your regional selection from the HG003 bam should run on HiFiCNV in theory, even without a smaller genome. If you have the result without the exclude file, or are able to share the bam file we'd be able to take a look and help, just open a new issue in this case.