PacificBiosciences / pb-CpG-tools

Collection of tools for the analysis of CpG data
BSD 3-Clause Clear License
73 stars 6 forks source link

how do I know If haplotype information is present? #12

Closed Raysun61 closed 2 years ago

Raysun61 commented 2 years ago

I am sorry for being a fresh learner. I just want know in what file haplotype information should be present.

Right now, I have fasta files below: AJ.hap1.fa AJ.hap1.fa.fai AJ.hap1.fa.stats AJ.hap2.fa AJ.hap2.fa.fai AJ.hap2.fa.stats AJ.pri.fa AJ.pri.fa.fai AJ.pri.fa.stats

So, how can I get haplotype information bed/bw like example? Or, just run it three times separately ?

ctsa commented 2 years ago

The model will make separate haplotype pileup tracks when haplotype tags are found in the input BAM file, based on the HP tag on different reads, where the value 1 or 2 are used to indicate the haplotype and 0 is used to indicate no haplotype. This formatting corresponds to the output of a read-backed phasing tool such as Whatshap.

The tag that the method looks for can be changed using the --hap_tag argument.

dportik commented 2 years ago

Hi @Raysun61 Just to clarify further, the input file for the pileup script is an aligned BAM file. To make use of haplotype information, the reads must have associated HP tags.

In your case, it looks like you are starting with different fasta files for each haplotype. There are a couple of options to get to haplotype-specific results:

1) Map all of your reads to the reference and assign HP tags per read. This may not be realistic given your current set of files.

2) You could potentially align each set of haplotype reads to the reference to get two BAM files (one per haplotype). If you do this, you would then run the script twice to analyze each haplotype BAM. There would not be any haplotype tags present, so the only output would be the combined results. In this case, the combined results would actually represent the results for a given haplotype. This is not the ideal way to use the workflow, but it is a possible solution.

Raysun61 commented 2 years ago

Thanks for your reply. In fact, the problem is I don't know how to get HP tags. In my case, I use the pipeline below:

1. get kinetics information

cd 1.ccs_hifi-kinetics ccs --hifi-kinetics --min-passes 1 ../0.rawdata/hb.c1.subreads.bam hb.c1.kinetics.ccs.bam

2. get 5mC tags

cd 2.primrose primrose ../1.ccs_hifi-kinetics/hb.c1.kinetics.ccs.bam hb.c1.5mc.ccs.bam

3. align hb.c1.5mc.ccs.bam to moso_genome.fa

cd 3.pbmm2 pbmm2 align ../0.rawdata/moso_genome.fa ../2.primrose/hb.c1.5mc.ccs.bam hb2mb.bam --sort -j 120 -J 8 samtools index hb2aj.bam -@ 140

4. analyze CpG data

cd 4.CpG conda activate cpg python pb-CpG-tools-1.0.0/aligned_bam_to_cpg_scores.py -b ../3.pbmm2/hb2mb.bam -f ../0.rawdata/moso_genome.fa -o hb2mb -s 500000 -t 48

additionally, get two haps using hifiasm from hifi reads:

HB.hap1.fa HB.hap2.fa HB.primary.fa

I am confused how to tag each read in a BAM file with HP tags. I read WhatsHap guide given @ctsa , it seems to need VCF file to complete:

image

However, hifiasm doesn't output VCF files. So, must I run WhatsHap to get VCF files first?(but my other analysis is based on hifiasm's out.) Is there any simpler resolution like some useful pacbio-tools?😭

dportik commented 2 years ago

Hi @Raysun61, You could try using this workflow to add the HP tags to the aligned BAM: https://github.com/williamrowell/hifi-deepvariant-snakemake

In general, you would need to: 1) add kinetics information (already completed) 2) add basemods information (already completed) 3) align to reference (already completed) 4) call variants 5) phase variants 6) apply phase information to BAM (gets you the HP tags)