AntonBankevich / LJA

Other
108 stars 16 forks source link

Purge dups ? #15

Open Johnsonzcode opened 2 years ago

Johnsonzcode commented 2 years ago

Hi @AntonBankevich ?

Is there need to purge haplotype duplication of the output assebmly ? As far as I know, Canu and HiFiasm needs another run of purge_dups.

Thanks in advance!

Sincerely Johnsonz

AntonBankevich commented 2 years ago

Hi! Thank you for your interest in LJA. Current version of LJA treats diploid genomes as two separate genomes that just happened to be similar. Thus no collapsing is performed for similar sequences resulting in shorter contigs and duplicated sequences. Producing completely purged of duplications (consensus) assembly as well as phased assemblies with much longer blocks (using combinations with other technologies) is what we are currently working on. We plan to present it in the next big release paper.

Johnsonzcode commented 2 years ago

Is that mean LJA producing two haploid genomes with similar size ?

Johnsonzcode commented 2 years ago

The extent of fragment in my assmbly is very high (contigs=8067), and I found the size is very huge ~ 1.7G (expected size ~ 1.1G). How could I tune the parameters such as k and K

AntonBankevich commented 2 years ago

Hi! Sorry for taking long to reply. LJA tries to produce two haploid genomes but it is often not possible because the read length is less than the length of conservative regions with no divergence between paired chromosomes. In current version we intentionally do not perform any duplication purging to retain as much information as possible. So for example if you use hifiasm than our output corresponds to their contigs in r_utg.gfa file . That is why you have many contigs in the output and their total length is high (it should be closer to double length of the genome). These contigs may be shorter but they are more "honest". We are working on producing all kinds of contigs including consensus like hifiasm does but this is still in progress.

Johnsonzcode commented 2 years ago

Thanks a lot. I try to assemble sex chromosome in chicken. It is single haplotype. But it turns out:

WARNING: no reads passed the length filter 2500

There is my reads statistics:

(asm_practise) $ seqkit stats hifi_silkie_wmap_zw_unmap_non-supp-secd.fa
file                                        format  type     num_seqs        sum_len  min_len   avg_len  max_len
hifi_silkie_wmap_zw_unmap_non-supp-secd.fa  FASTA   Unlimit   178,129  2,072,830,671        0  11,636.7   27,733
Johnsonzcode commented 2 years ago

Found the problem: If there is a emplty reads in fasta file LJA will show the warning.

AntonBankevich commented 2 years ago

I see. Personally I think that blank lines and empty sequences should not be allowed in fasta files but I did not find any indication of that in the fasta file specification. So in the next release blank lines and empty sequences will be allowed in input files. Currently you can use the LJA version from branch "bug_fix" to access this feature.

christinawu2008 commented 2 years ago

The extent of fragment in my assmbly is very high (contigs=8067), and I found the size is very huge ~ 1.7G (expected size ~ 1.1G). How could I tune the parameters such as k and K

Hi @AntonBankevich, I have the same question here: How to set appropriate 'k' and 'K'? Also, how to phase from the resulted assembly? purge haplotigs for partially phasing? or use HiC reads for haplotype-resolved assemblies?

Thanks! Chen