PopicLab / cue

Deep learning framework for SV calling and genotyping
MIT License
102 stars 20 forks source link

0 Selected intervals and empty VCF output when running Cue on PacBio CLR #14

Closed LYC-vio closed 1 year ago

LYC-vio commented 1 year ago

Hi,

Thank you for developing this excellent tool. I've recently tried to use Cue to call SVs on a long-read data BAM file (NA24385_Pacbio_CLR_SRX7668835, aligned to hg19 using minimap2), but got an empty VCF output with no error reported.

In the logging info there were lines saying that no intervals where selected:

...
INFO:root:Number of bins: 108262
INFO:root:Selected 0 intervals
INFO:root:Selected 0 interval pairs out of 0 pairs
INFO:root:Processed 238694 reads
INFO:root:Generating SV predictions for chr22
INFO:root:Number of target interval pairs: 0
INFO:root:Selected 0 intervals
INFO:root:Selected 0 interval pairs out of 0 pairs
INFO:root:Processed 310571 reads
INFO:root:Generating SV predictions for chr20
INFO:root:Number of target interval pairs: 0
...

However I have no idea what might cause this issue.

I also noticed that you used Cue-long to run on the CLR data in your paper, did that refer to another version of Cue or there were additional settings required in the yaml configuration for long reads?

Thank you

Best, Yichen

Here's the detailed configuration I used in my run:

*********************************
*  cue (v0.2.2): discovery mode *
*********************************
[INFO]  ========== Model config ==========
    model_path: Softwares/cue/data/models/cue.v2.pt
    gpu_ids: []
    n_jobs_per_gpu: 1
    n_cpus: 20
    report_interval: 100
    batch_size: 16
    logging_level: INFO
    signal_set: SV_SIGNAL_SET.SHORT
    class_set: SV_CLASS_SET.BASIC5ZYG
    num_keypoints: 1
    model_architecture: HG
    image_dim: 256
    sigma: 10
    stride: 4
    heatmap_peak_threshold: 0.4
    pretrained_refinenn_path: None
    config_file: call_model.yaml
    experiment_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus
    devices: [device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu')]
    device: cpu
    log_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/logs/
    report_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/reports/
    log_file: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/logs/main.log
    classes: ['NEG', 'DEL-HOM', 'INV-HOM', 'DUP-HOM', 'DEL-HET', 'INV-HET', 'DUP-HET', 'IDUP-HOM', 'IDUP-HET']
    num_classes: 9
    n_signals: 6
[INFO] ========== Data config =========
    bam: NA24385_Pacbio_CLR_SRX7668835/minimap2_NA24385_Pacbio_CLR_SRX7668835.bam
    fai: refdata-hg19-2.1.0/fasta/genome.fa.fai
    chr_names: None
    logging_level: ERROR
    n_cpus: 1
    min_refine_buffer: 2000
    refine_buffer_frac_size: 5
    refine_pair_dist_frac_size: 2
    refine_bp_kernels: [0, 50, 500]
    refine_min_support: 2
    refine_disable: False
    min_pair_support: 2
    min_pair_distance: 4000
    max_pair_distance: 1000000
    scan_target_intervals: True
    stream: True
    view_mode: False
    store_img: False
    empty_annotation: False
    bins_per_block: 8000
    min_sv_len: 4000
    min_qual_score: 50
    bam_type: BAM_TYPE.SHORT
    signal_set: SV_SIGNAL_SET.SHORT
    signal_set_origin: SHORT
    bed: None
    blacklist_bed: None
    signal_vmax: {'RD': 600, 'RD_LOW': 800, 'RD_CLIPPED': 600, 'SM': 200, 'SR_RP': 600, 'LR': 600, 'LLRR': 100, 'RL': 100, 'LLRR_VS_LR': 1}
    signal_mapq: {'RD': 20, 'RD_LOW': 0, 'RD_CLIPPED': 20, 'SM': 20, 'SR_RP': 0, 'LR': 0, 'LLRR': 1, 'RL': 1, 'LLRR_VS_LR': 1}
    bin_size: 750
    interval_size: 150000
    step_size: 50000
    shift_size: None
    heatmap_dim: 1000
    image_dim: 256
    class_set: SV_CLASS_SET.BASIC5ZYG
    num_keypoints: 1
    bbox_padding: 0
    config_file: call_data.yaml
    dataset_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus
    info_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/info/
    image_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/images/
    annotation_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/annotations/
    annotated_images_dir: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/annotated_images/
    classes: ['NEG', 'DEL-HOM', 'INV-HOM', 'DUP-HOM', 'DEL-HET', 'INV-HET', 'DUP-HET', 'IDUP-HOM', 'IDUP-HET']
    num_classes: 9
    num_signals: 6
    uid: 0000000000
    log_file: Cue/NA24385_Pacbio_CLR_SRX7668835_cpus/info/main.log

The BAM file was generated with:

minimap2 -t 30 --MD -Y -L -a -H -x map-pb refdata-hg19-2.1.0/fasta/genome.fa PacBio_CLR_ncbi-SRX7668835/SRR11008518.fastq | samtools sort -o minimap2_NA24385_Pacbio_CLR_SRX7668835.bam
viq854 commented 1 year ago

Hi @LYC-vio, thank you for posting this question.

The Cue framework currently provides an extensively trained model and out-of-the-box support only for short reads. The Cue-long model is a separate proof-of-concept preliminary model trained to demo the extensibility of the framework to another technology — we trained and evaluated this model only on limited synthetic data to show how the framework can be extended to achieve strong performance with different input types as described in the “Extending Cue” and “Discussion” sections of the manuscript (more information about this benchmark/model/repro is also available in our cue-synth-datasets GCS bucket; guidelines for how the framework can be extended to custom technologies is available in the "extensions.ipynb" notebook). This model is not yet intended for use on real data — much more extensive training and evaluation is needed to deploy it on real genomes (similar to the short-read strategy described in the paper) — but we’re working on this now and will release new models and full-support for more technologies soon! I’ll add further clarification to the README as well.

Thanks, V

LYC-vio commented 1 year ago

Hi @viq854 ,

Sorry for open this issue again. Just a quick question about when will Cue support long-read data.

Thank you again for your time and efforts

Best regards Yichen

viq854 commented 1 year ago

Hi @LYC-vio,

Planning to release fully trained models for PacBio sometime by the end of the summer.

Best, V

quanc1989 commented 5 months ago

@viq854 Hi! Wondering how is going on with PacBio?