PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
62 stars 5 forks source link

Is CpG status predicted at insertion variants? #40

Open elcortegano opened 1 year ago

elcortegano commented 1 year ago

Following a recent issue open in the IGV site (https://github.com/igvteam/igv/issues/1303), we wonder if aligned_bam_to_cpg_scores.py would call 5mC methylated CpG sites at insertions (i.e. at sites that are not in the reference).

I run primrose in my data on the CCS reads, and then 5mC sites were called from a pbmm2 aligned BAM similarly as:

python3 aligned_bam_to_cpg_scores.py \
    -b sample.5mc.pbmm2.bam \
    -f reference.fa \
    -o founders \
    -p model \
    -c 5 \
    -d pileup_calling_model/ \
    -t 16 
ctsa commented 1 year ago

Thanks @elcortegano , I just commented on the IGV ticket -- in summary the HiFi reads always contain methylation information in insertions if they've had primrose 5mCpG calling applied to them (this process occurs on unaligned reads). For the pileup tools, they do not provide an output for insertions today. In the default denovo mode for aligned_bam_to_cpg_tools, we could provide a summary for insertions in theory, but there are a few practical barriers, the first being how to communicate these results in a standardized file format. Neither bed nor bigwig output would support this. If you have an output format and corresponding viewer in mind, we could look into supporting it.

elcortegano commented 1 year ago

Thanks for all the clarifications @ctsa ! specially for the input on the 5mCpG calls in the BAMs.

Regarding the issue here with the pileup methods, I understand the difficulties in adding insertions data. The only standard format I could think of for adding that data would look like a VCF, with the different modification probabilities in the FORMAT field (?).

I think that could be useful. I'll admit that for my personal uses right now it would be fine having IGV visualization for 5mC sites. However, relying on IGV visualization is not scalable, so I think other users might benefit of a feature like this in the future.

ctsa commented 1 year ago

Okay that sounds good. I'll be in touch with Jim if we can help with the IGV visualization at the read level. For the pileup we'll look out for a way to do this, perhaps VCF or another format will make this doable in future.

leon945945 commented 1 year ago

same requirement as @elcortegano, HiFi data is beneficial for insertion identification and the 5mC methylation status may be highly related with these insertions, the support for 5mC of insertion sequence is desired.