KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
223 stars 24 forks source link

Genomescope and smudgeplot for genome with several whole-genome duplications #97

Closed VictorHerp closed 2 years ago

VictorHerp commented 2 years ago

Hi Kamil, I would like to ask for your advice with a fairly problematic genome. I used Genomescope to estimate the genome length of the Adriatic sturgeon (Acipenser naccarii, 2n=239 +/- 7 chromosomes, I attached the karyotype) considered to be functionally tetraploid, but the results are confusing. Although I have “good” nucleotide coverage (89x, that is, ~22x per haploid) there is a first peak that seems informative but Genomescope, Tetmer and Smudgeplot indicate that it is probably a peak of errors (although it has a kcov of ~ 12, I attached the genomescope plot). One thing to consider is the evolutionary history of this species which could create a bias in the analysis. It is hypothesized that this species had reached a level of octaploidy (with at least two whole-genome duplications) before undergoing a process of diploidization, therefore it is currently considered to be tetraploid even if some loci could still be octaploid or even diploid. One interesting thing that Hannes Becher told me is that the k-mer spectrum might look like an octaploid (if the multiplicity 12 peak is a data peak and isn’t due to contamination). I was wondering if maybe what Genomescope does is detect the maximum level of polyploidy in the genome even though in theory most of the loci in this genome are tetraploid. So, to corroborate the haplotype structure of the genome I used smudgeplot, which left me with more doubts than answers. According to smudgeplot the genome is pentaploid (something maybe not very likely), but it gives me this warning:

"detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to 35 … the same message up to bins to 14
!! Warning, the two types of estimates of 1n coverage differ a lot (16.5 and 40.0648121650789)
i.e. at least of one of the smudgeplot methods to estimate the haploid coverage got it wrong
Using subset estimate instead of highest peak estimate (less precise but also less often completely wrong)
Does the smudgeplot look sane? Is at least one of the 1n estimates close a GenomeScope estimate?
You can help us improve this software by sharing this strange smudgeplot on https://github.com/KamilSJaron/smudgeplot/issues."

This is the code that I used (L and U based on the genomescope plot), I also tried with “smudgeplot.py cutoff, result: L10, U980”:

####For k-mer spectrum:
kmc -k21 -m100G -t20 -ci1 -cs15000 $DIR1 kmcm8-all $WD (I also tried -cs500000000)
kmc_tools transform kmcm8-all histogram kmcm8-all.hist -cx500000000 (I removed lines with zero)
####For smudgeplot
L=7
U=2000
kmc_tools transform kmcm8-all -ci"$L" -cx"$U" reduce "$NAME"_L"$L"_U"$U"
smudge_pairs "$NAME"_L"$L"_U"$U" "$NAME"_L"$L"_U"$U"_coverages.tsv "$NAME"_L"$L"_U"$U"_pairs.tsv > "$NAME"_L"$L"_U"$U"_familysizes.tsv
smudgeplot.py plot -o Ac_naccarii -t "Acipenser naccarii" "$NAME"_L"$L"_U"$U"_coverages.tsv

I know that the coverage I have is low compared to the requirements for smudgeplot, so I would like to ask you if in your opinion the fact that there are no octaploid haplotype structures and that there are pentaploid structures instead of tetraploids is a problem related to the low kcov for haploid or is there something else that I am missing?

Waiting for your answer, thank you! Víctor Muñoz

Karyotype A  naccarri_2n=239 chromosomes M8_genomescope_KMC-21-mer Ac_naccarii_all_genome_smudgeplot

KamilSJaron commented 2 years ago

Hi VIctor, sorry for the delay, this was a long post and I wanted to find more than few minutes so I could actually give it a more proper thought.

Anyways, this is an interesting genome and exactly the type of a genome we developed smudgeplot for (it was mostly polyploid root-knot nematodes we tried to understand). I agree the smudgeplot has some labeling issues. And the genomescope model also looks a bit funky, but I got to say, this is one of those datasets where I would prefer to try to fit a few models before I will attempt to guess what is what. Would you be willing to share me the kmer pairs file and the k-mer histogram?

Finally, you probably know the expected 1C value / genome size, right? That can be really useful when trying to fit genome models.

VictorHerp commented 2 years ago

Hi Kamil, Thank you so much for the help, I sent you a message to your email to be able to send you the files that you asked for. Greetings.

KamilSJaron commented 2 years ago

Cool (I did see the email - that is one big k-mer pair file :-D )

KamilSJaron commented 2 years ago

I consider this issue resolved, but if anythimg, don't hesitate to reopen the issue please.