aaranyue / quarTeT

A telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification
http://atcgn.com:8080/quarTeT/home.html
81 stars 6 forks source link

[Usage and inputs for Quartet] What assembly should be used in input? #31

Closed Isoris closed 3 months ago

Isoris commented 5 months ago

Hello,

Thank you for making QuarTeT available.

My question is related to the choice of input assembly, from my understanding Assembly mapper will scaffold based on homology, what if the scaffolds after Assembly mapper are too short (1/3rd of the genome size).

In particular when after HiC scaffolding for instance (3d-dna assembly pipeline) we get an assembly of the expected genome size. So my question is, in RagTag.py scaffold, the contigs will be oriented based on the reference. How does QuarTeT deals with the structure of the assembly? I can see that QuarTeT is conservative and will filter out the scaffolds NOT present in the reference.

In short, do you think we can directly start at the Gap Filling step when using the hic assembly?

Also after Assembly-mapper, when we concatenate the two haplotypes on a single file, lets say hap0 + hap 1 > hap0_1

The results of hap_0_1 are similar to hap0 but dissimilar to 1. Is it because minimap 2 map the contigs of hap0 first and because there is only a single hit for each alignment, hap1 which comes later in the file is ignored?

Thank you for the answers. Quen

Echoring commented 5 months ago
  1. AssemblyMapper is not very different from RagTag. Both ot them rely on homology to the reference. If the results is so different, it may due to inappropriate parameter, like required alignment length/similarity is too high.
  2. Using assembly from Hi-C scaffording to proceed is often a better choice.
  3. If the input include multiple haplotypes, AssemblyMapper may errorly connect them and made duplication, but your case looks like two haplotypes differs a lot and current parameter not suit haplotype 1.
Isoris commented 5 months ago

Hello,

I understand for statement 1 and statement 2 but for statement 3 here is what I found when running AssemblyMapper (-c 25000 -l 5000)

hap0_hic_asm5_c25000_l5000_QUARTET-CMA_HIC_UL_GCA_030347435 1_Clarias_fuscus draftgenome

hap1_hic_asm5_c25000_l5000_QUARTET-CMA_HIC_UL_GCA_030347435 1_Clarias_fuscus draftgenome

hap10_hic_asm5_c25000_l5000_QUARTET-CMA_HIC_UL_GCA_030347435 1_Clarias_fuscus draftgenome

Still I find it strange that all of the contigs in the combined haps are from hap1. What do you think?

Isoris commented 5 months ago

So I followed your advice. Everything is look good now. Thank you so much.

However I an just curious why when we combine haplotypes only the second set of contigs align to the reference and the output is like this.. does that means that this set of chromosomes is closer to the reference than the other set? Or is it just that the best hits of the previous comtigs ( the first lines of the cat combined.fasta ) are overwrited by the second set of contigs which is at the end of the combined.fasta? Is it due to some minimap2 settings? Or because they have the same name so in the parsing of assembly mapper the dictionary has it's values erased by the second set of contigs ?

Anyway.

I will continue with the next steps of the pipeline and leave the combination of haplotypes because it is indeed not necessary for the following steps.

Thank you so much !! 🙏🏻

Echoring commented 5 months ago

Oh, your two haplotypes of contigs has same IDs? In this case, the later appeared one will overwrite the previous ones. This may be the reason.

Isoris commented 5 months ago

Ah ok that makes sense. !

(In my case I got 28 Chr approximately) and small contigs but the GapFiller throws the error that -f the contigs are too short and that gaps should not have gaps too close of each others?

I would like to know, what to do with the assembled remaining scaffolds all the small chunks < 0.5Mb but > 100kb ? I have read that GapFiller join is still buggy, what do you suggest for dealing will all the small unassembled segments, in some publications I have seen that people refer them as microchromosomes but from karyotype imaging I didn't see any microchromosomes..

And so after the HiC scaffolding 3ddna we have to start from the centromere step if we have alreagy non gapped full lengths scaffolds?

Thank you for your answer.

Echoring commented 5 months ago

GapFiller throws this error when 2 gaps are too close, indicating that a very small contig (<5000 bp by default) is placed in between. Contig of this size is usually unreliable and hard to decide the position. Taken this into gap filling may lead to more error filling, so it is recommended to part these small segments and leave them be as unplaced contigs without gap filling.