chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
530 stars 87 forks source link

Polyploidy graph binning of hifiasm #312

Closed zhangyixing3 closed 2 years ago

zhangyixing3 commented 2 years ago

hello, I am grateful to you for produces hifiasm .It is very powerful for simple genome. I have a big autopolyploid plant genome. In order to get all chromosomes ,I must use p_utg.gfa . I will use the figure to illustrate my doubts. In the figure, I can get unitig 1-7 , However, there are unitig 1 2 、unitig 2、unitig 32、unitig 4、unitig 5 * 2、unitig 6、unitig 7 in actual genome,This means thatp_utg.fa is smaller than actual genome, and I have to find a way to replicate the same conitg between haplotypes. I'm not sure if my idea is correct. Can you give me some advice? Thank you! 4383be0559d44195305ad0f4bf914b2

chhylp123 commented 2 years ago

Sorry for the late reply. Here is a pretty nice example if you would like to work on polyploid genome: https://github.com/baozg/Potato_C88.

zhangyixing3 commented 2 years ago

Thank you for your help.It seems really difficult to answer this question.

zhangyixing3 commented 2 years ago

Sorry to bother you again., I can't undersatnd Polyploidy graph binning of hifiasm. In Autotetraploid potato paper,they use below command .but they have no additional explanation.

hifiasm -t 64 -o C88 -5 C88.hifiasm.binutg.reads.list --n-hap 4 --hom-cov 120 C88.HiFi.fa.gz
# -5 

in Hifiasm Parameter Reference , I can't find the description of this parameter.Can you give me some help? Thank you!

baozg commented 2 years ago

I think we have put a brief introduction for -5, which is enough for others to run this version with their own data (Actually, it did work in other species!). I do put the result which I use in our tetraploid potato project. The first was the diferent haplotype groups (We also added the linkage group, but bascially hifiasm only use the _1/_2/_3/_4 four groups), the reads name was from utg gfa (non-contianed reads, we binned the utg first, then use all non-contained reads from one untig as haplotype group). Would you mind giving more detail about your question? Since it was a hidden parameters only in the developed branch of hifiasm, so the description didn't added.

-5 represent the phase information (below), the collapsed region will use the reads more than once, --n-hap 4 indicated the ploidy and --hom-cov is the homozyous peak in assembly. Group non-contained HiFi reads
LG1_1 m64053_200110_120759/100206539/ccs
LG1_1 m64053_200110_120759/100270139/ccs
LG1_1 m64053_200110_120759/100272825/ccs
LG1_1 m64053_200110_120759/100402742/ccs
LG1_1 m64053_200110_120759/100467929/ccs
LG1_1 m64053_200110_120759/100468612/ccs
LG1_1 m64053_200110_120759/100534820/ccs
zhangyixing3 commented 2 years ago

Your work is excellent. Maybe I'm a beginner , so I need a more detail. I am really interested in the -5 parameter.
I'm curious what is non-contianed reads in GFA? I guess it's the CCS sequence that constitute unitig in gfa. Then hifiasm use this phase information reassemble. Right? Thank you for your warm-hearted help.

baozg commented 2 years ago

Non-contained reads just directly from the hifiasm output *.p_utg.noseq.gfa. Bascially, it was based on all-to-all overlap to select the representative reads.