chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
533 stars 87 forks source link

Phased diploid human genome with three copies of a ~28 Mbp region #199

Open glogsdon1 opened 3 years ago

glogsdon1 commented 3 years ago

Hi,

I have a human genome that has a ~28 Mbp inverted duplication (invdup) on one of the haplotypes. This invdup region exists 3 times in the genome: once on the paternal chromosome, twice on the maternal chromosome. When I assemble the genome using parental Illumina data and ~40X HiFi data, I get an assembly where there is only one copy of the invdup region on both the maternal and paternal haplotypes. The third copy is not in the final assembly and, I think, may have been purged prior to the generating the last graph. The command I use for the assembly is the following: hifiasm -o J3.hifiasm -t32 -1 J3.mat.yak -2 J3.pat.yak $(cat J3/fastq.fofn)

The log is here: https://eichlerlab.gs.washington.edu/help/glogsdon/Shared/J3_hifiasm_triobinned_assembly.log

Do you have a recommendation for how to generate the assembly that keeps the third copy of the invdup region?

chhylp123 commented 3 years ago

~28 Mbp is very long so that it is hard for hifiasm to totally lose it even there is a bug. Can you find 3 copies in the p_utg.gfa? If you can share the bin files with us, we are happy to check it on our side.

glogsdon1 commented 3 years ago

We're planning to check for the missing contig in the p_utg.gfa, but we haven't run the pipeline yet. In any case, I've uploaded the bin files here for you to check out: https://eichlerlab.gs.washington.edu/help/glogsdon/Shared/J3/

The contig should align to chr8:12,000,000-40,000,000 in T2T-CHM13 v1.1.

chhylp123 commented 3 years ago

Thanks a lot!

chhylp123 commented 2 years ago

@glogsdon1 Could you please also share the parental yak indexes?

glogsdon1 commented 2 years ago

Sure, they are here: https://eichlerlab.gs.washington.edu/help/glogsdon/Shared/J3/J3.mat.yak https://eichlerlab.gs.washington.edu/help/glogsdon/Shared/J3/J3.pat.yak

glogsdon1 commented 2 years ago

Hi Haoyu, were you able to find the missing contig?

chhylp123 commented 2 years ago

@glogsdon1 Sorry for the delay. I will reach out to you this evening.

chhylp123 commented 2 years ago

Hi @glogsdon1, sorry for the delay as it is tricky to debug on the graph. I wonder if this sample is totally diploid. Here is the example: J3

In this subgraph, the blue nodes are in the paternal contigs, the green nodes are in the maternal contigs, while the red nodes are missed in both haplotypes. Please note that these red nodes are also labeled as maternal by trio-binning. So there are two copies on the maternal haplotype and one copy on the paternal haplotype. If you look at this subgraph, it seems there are three haplotypes. So is this sample locally triploid on this region?

lh3 commented 2 years ago

So is this sample locally triploid on this region?

Glennis suggested there are three copies of a 28Mb region in total. It seems that hifiasm has preserved the local structure of the three copies for the most part but failed to resolve the global structure in the final contig assembly. Looking at the subgraph here, I guess automatically deriving the correct assembly will be hard.

@glogsdon1, you may consider to manually separate the two maternal copies. I don't know how hard it can be, though.

glogsdon1 commented 2 years ago

Yes, Heng is correct that there are three copies: two maternal and one paternal. Could UL ONT reads help resolve the graph? If not, is there a way to pull out the individual unitigs to see if I can manually reconstruct it?

lh3 commented 2 years ago

Could UL ONT reads help resolve the graph?

UL ONT should help in theory, but we don't have an automatic method to integrate that yet.

is there a way to pull out the individual unitigs to see if I can manually reconstruct it?

Perhaps you can map unitigs to one copy to extract unitigs from the region. In the hifiasm output GFA, you can see a HG:A tag on each A-line. This tag indicates the parental source: "p" for paternal, "m" for maternal and "a" for ambiguous. Then you can make a good guess about the parental source of unitigs.

chhylp123 commented 2 years ago

@lh3 @glogsdon1 This region is very abnormal and I haven't seen it in diploid samples before. There is a very long bubble chain of >10Mb in length, and every bubble on it has such a weird subgraph. That is, looks like there are 3 sides in each bubble. J3 e: For trio-hifiasm, it has a diploid assumption so that it will output 2 sides of each bubble. So the third side will be ignored on the final contigs. I'm very curious about is this the real biology or hifiasm collapses two copies in mother? Could you please get the IGV screen shot of read alignment around these regions?

glogsdon1 commented 2 years ago

@chhylp123 @lh3 This is real biology and not an artifact from the assembly. This genome has a known 8p inv/dup/del on the maternal chromosome that has been picked up with SNP microarray and FISH. The structure is similar to this:

Screen Shot 2021-11-09 at 1 43 37 PM

Here's an IGV screenshot from aligned HiFi reads to the maternal and paternal contigs that were output by hifiasm (i.e. the assembly is missing the third copy, so reads are mismapping to the maternal contig and showing twice the coverage).

Screen Shot 2021-11-09 at 1 44 35 PM
chhylp123 commented 2 years ago

Probably I can manually extract the missed maternal copy. However, I worry the assembled two maternal copies have switched issues, that is, one contig might have parts of copy 1 and parts of copy 2. This is because both trio-binning and HiFi cannot distinguish them globally.

glogsdon1 commented 2 years ago

@chhylp123 That is a really good point. I think this should be okay for now, since we are trying to define the breakpoints. If you could pull out the missing unitigs/contig, that would be really helpful!

chhylp123 commented 2 years ago

Hi @glogsdon1, I have pushed a version (https://github.com/chhylp123/hifiasm/tree/hifiasm_dev_debug) integrating a new option --kpt-rate. It is used to output the remaining contigs which are missed in both haplotypes. The output file is at *dip.kdp.p_ctg.gfa. For example, --kpt-rate 0.8 means it will output contigs that have less than 80% of reads contained in *dip.hap1.p_ctg.gfa or *dip.hap2.p_ctg.gfa. I have checked J3 sample and *dip.kdp.p_ctg.gfa successfully recovered a 17Mb contig.

glogsdon1 commented 2 years ago

Hi @chhylp123, thanks for updating hifiasm! I'll try out the new version on our cluster and let you know if I have any issues. Do you happen to know if the 17 Mbp contig is from chr8? Hopefully, I can recover a few contigs that make up the ~28 Mbp duplication.

chhylp123 commented 2 years ago

Yes, I can make sure the 17Mb contig comes from that repetitive region of chr8. But I haven't checked where are the remaining ~11Mb regions carefully.

glogsdon1 commented 2 years ago

Hi @chhylp123, I was wondering what the default --kpt-rate value is. I didn't see in in the help message in this version.

chhylp123 commented 2 years ago

Default is 0, so it won't output anything. Set values > 0 for --kpt-rate making hifiasm output something in *dip.kdp.p_ctg.gfa. For 17Mb contig, I remember 50% of reads have been already contained in hap1 and hap2, so that you need to set at least 0.6 for --kpt-rate.

glogsdon1 commented 2 years ago

Thanks for the quick reply! Good to know!