Genome gets smaller after correction

Bank-tidy commented 5 months ago

Hello author, I tried to use your software to correct two genomes again, and the genomes became smaller. What is the reason? In theory, can't it only correct SNPs and indels?

These are my codes:

minimap2 -ax map-hifi -t 60 LC_filled_N0.fasta /public1/home/yinhang/projects/two_genomes/01_data/HiFi_fastq/LC.ccs.fq|samtools sort -o LC.sort.bam -
samtools index LC.sort.bam

yak count -o k21_ngs.yak -k 21 -b 37 NGS_correct_1.fq.gz NGS_correct_2.fq.gz
yak count -o k31_ngs.yak -k 31 -b 37 NGS_correct_1.fq.gz NGS_correct_2.fq.gz

nextPolish2 -t 60 LC.sort.bam LC_filled_N0.fasta k21_ngs.yak k31_ngs.yak > LC_corrected_N0.fa

Marison-1 commented 5 months ago

Hello author, I tried to use your software to correct two genomes again, and the genomes became smaller. What is the reason? In theory, can't it only correct SNPs and indels?

These are my codes:
minimap2 -ax map-hifi -t 60 LC_filled_N0.fasta /public1/home/yinhang/projects/two_genomes/01_data/HiFi_fastq/LC.ccs.fq|samtools sort -o LC.sort.bam -
samtools index LC.sort.bam

yak count -o k21_ngs.yak -k 21 -b 37 NGS_correct_1.fq.gz NGS_correct_2.fq.gz
yak count -o k31_ngs.yak -k 31 -b 37 NGS_correct_1.fq.gz NGS_correct_2.fq.gz

nextPolish2 -t 60 LC.sort.bam LC_filled_N0.fasta k21_ngs.yak k31_ngs.yak > LC_corrected_N0.fa

I met the same problem as you. I polished 5 assembly use this tool, and each of them smaller 1~4Mb than before and reference. Maybe it's reasonable,but I expect the author for a professional explaination.

moold commented 4 months ago

It feels like there's something wrong, the genome size shouldn't change too much, Does the corrected genome have the same number sequence as the original ref? Could you share the detailed statistics to here?

Bank-tidy commented 4 months ago

It feels like there's something wrong, the genome size shouldn't change too much, Does the corrected genome have the same number sequence as the original ref? Could you share the detailed statistics to here?

Not change too much ,still 12 sequencs, from 1096.96 Mb to 1096.71 Mb. Changed about 250 Kb, is it reasonable?

Bank-tidy commented 4 months ago

It feels like there's something wrong, the genome size shouldn't change too much, Does the corrected genome have the same number sequence as the original ref? Could you share the detailed statistics to here?

Detail changes are as follows: group1 132785395 -> 132718027 group2 114892731 -> 114887727 group3 103262822 -> 103224677 group4 100419264 -> 100413931 group5 99907961 -> 99907504 group6 93758231 -> 93727397 group7 93135190 -> 93092808 group8 88913125 -> 88895220 group9 87091357 -> 87077464 group10 85308775 -> 85298291 group11 76449378 -> 76436882 group12 74317084 -> 74304218

Looking forward to your reply

moold commented 4 months ago

I did some tests but didn't get the result that the genome size changed much, so coulld you extract the sequence and bam file ofgroup7, and k21_ngs.yak k31_ngs.yak, and share me these files? BTW, this should be caused by some heterozygous long insertion or deletion variations.

Bank-tidy commented 4 months ago

I did some tests but didn't get the result that the genome size changed much, so coulld you extract the sequence and bam file ofgroup7, and k21_ngs.yak k31_ngs.yak, and share me these files? BTW, this should be caused by some heterozygous long insertion or deletion variations.

Of course! But is there any good way to share it (totally aboout 15G) with you? I’m in China.

Bank-tidy commented 4 months ago

I did some tests but didn't get the result that the genome size changed much, so coulld you extract the sequence and bam file ofgroup7, and k21_ngs.yak k31_ngs.yak, and share me these files? BTW, this should be caused by some heterozygous long insertion or deletion variations.

The heterozygosity rate of my sample is indeed a bit high, and the assembled genome may be larger than expected. In this case, are the corrected results credible?Thank you!

moold commented 4 months ago

You can map the corrected genome to the original reference genome and see where the long changes caused by "NextPolish2" are, and then check the bam file for those changed regions.

15G is too big to share, so I will find more datasets to debug this problem, which will take more time.

moold commented 3 weeks ago

We have released a new version with a new option --max_indel_len, you can try it to see if it can solve this problem.

Marison-1 commented 3 weeks ago

Thank you very much for maintaining this nice tool!

---- Replied Message ---- | From | Hu @.> | | Date | 08/05/2024 16:00 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [Nextomics/NextPolish2] Genome gets smaller after correction (Issue #16) |

We have released a new version with a new option --max_indel_len, you can try it to see if it can solve this problem.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Nextomics / NextPolish2

Genome gets smaller after correction #16