epi2me-labs / wf-pore-c

Other
35 stars 9 forks source link

Question about Creating Haplotype-Specific Hi-C Links and Methylation Levels #52

Closed jjuhyunkim closed 6 months ago

jjuhyunkim commented 8 months ago

Hi developers,

Thank you for developing this wonderful algorithm.

I have a question regarding how to create haplotype-specific Hi-C links (or any relevant format). I am interested in calling haplotype specific methylation levels, as described in your paper, and haplotype-specific Hi-C link.

I believe that the ./bams/{{ alias }}.cs.bam file contains HP tags in the BAM format, but this information is not present in the ./paired_end/{{ alias }}.ns.bam file. I'm concerned that ./bams/{{ alias }}.cs.bam might differ from other normal Hi-C data, so I'm hesitant to use the poreC data with traditional tools.

I hope my question isn't too trivial.

Thanks.

jjuhyunkim commented 8 months ago

To elaborate further on my question, I aim to generate a plot similar to Extended Data Fig.4f, comparing the methylation and contact map between chrXi and chrXa. I have the following data available:

  1. reference data
  2. PoreC reads(unmapped bam)
  3. phased VCF file.

Could you please provide a more detailed techinical description of the "XCI analysis" part in method from your paper, particularly regarding the phrase "Using the phasing approach, concatemers were assigned to their constituent haplotypes. Hiigh-order contats were melted into pairwise contatcts using GxG."?

Thanks!

LynnLy commented 7 months ago

Hi Juhyun,

You can try to add haplotag information to the mock paired end bam yourself, using whatshap. whatshap haplotag -o {output} {input.vcf} ./paired_end/{{ alias }}.cs.bam --ignore-read-groups -r {input.refgenome} After running this, you should have a mock paired end bam with HP tags, which you can use to split the bam into a separate file for each haplotype.

If whatshap still requires unique identifiers, you can use something like the two scripts here to create and then delete unique tags for the ids.

For the paper, after we split the equivalent of ./bams/{{ alias }}.cs.bam into haplotype1.bam and haplotype2.bam, we used those to create pairs, .cooler, and other files for downstream analysis. Extended Data Fig. 4f was probably plotted in python using .cooler files. I don't think haplotyped pairs/cooler are implemented in the workflow yet, but you can see how to go from bam -> pairs -> cooler -> etc yourself by looking here.

When we wrote the paper, methylation information was not yet encoded in bam files; we used f5c/nanopolish call-methylation and meth-freq on the equivalent of ./bams/{{ alias }}.cs.bam to get a table of the methylation frequency at each coordinate. If you have already called methylation during basecalling, it may be easier to use modkit to extract the methylation information.

Please let me know if you have any other questions.

sarahjeeeze commented 6 months ago

Closing through lack of response but feel free to re-open if you require continuing this discussion.

jjuhyunkim commented 3 months ago

Sorry for the delayed response. I followed your suggestion, but after running pairtools parse2, I ended up with all NN (null-multi) tags. Here’s what I did:

whatshap haplotag -o $prefix \
$vcf outputs/bams/null.cs.bam \
--ignore-read-groups --output-haplotag-list haplotypes.tsv \
-r $ref

whatshap split --output-h1 h1.bam --output-h2 h2.bam $prefix haplotypes.tsv

prefix=h1
ref_fai=${ref}.fai
fragment='output/pairs/fragments.bed'

args="--drop-sam --drop-seq --expand --add-pair-index --add-columns mapq,pos5,pos3,cigar,read_len,matched_bp,algn_ref_span,algn_read_span,dist_to_5,dist_to_3,mismatches"

samtools sort -n ${prefix}.bam > ${prefix}.ns.bam

pairtools parse2  \
    --output-stats "${prefix}.stats.txt" \
    -c $ref_fai --single-end --readid-transform 'readID.split(":")[0]' \
    $args ${prefix}.ns.bam > ${prefix}.extract_pairs.tmp

I successfully obtained haplotype-specific BAM files from outputs/bams/null.cs.bam.

(poreC) [kimj75@cn4270 output]$ samtools flagstats -@ 50 h1.ns.bam 
20861000 + 0 in total (QC-passed reads + QC-failed reads)
20861000 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20861000 + 0 mapped (100.00% : N/A)
20861000 + 0 primary mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

And the stats from pairtools parse2 looks like this:

total   20861000
total_unmapped  20861000
total_single_sided_mapped       0
total_mapped    0
total_dups      0
total_nodups    0
cis     0
trans   0
pair_types/NN   20861000

Could you please give me some advice?

LynnLy commented 3 months ago

The newest version of pairtools (1.1.0) gives me a similar result. Can you downgrade to the same version used in pore-c-snakemake and try again?

For example, you could install it using conda with

conda create --name pairtools_1.0.2 -c conda-forge -c bioconda pairtools=1.0.2 numpy=1.20 python=3.8
jjuhyunkim commented 3 months ago

It works now! Thank you so much!!!