KamilSJaron / smudgeplot

Inference of ploidy and heterozygosity structure using whole genome sequencing data
Apache License 2.0
227 stars 24 forks source link

half genome length from tutorial-strawberry #84

Open fancyge opened 3 years ago

fancyge commented 3 years ago

This seems a silly question, but I really confused with why I get a half estimated compared to the tutorial with the same commands.

So I have followed https://github.com/KamilSJaron/smudgeplot/wiki/tutorial-strawberry to get started with smudgeplot and genomescope. The commands I used:

mkdir -p strawberry_iinumae && cd strawberry_iinumae
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR013/DRR013884/DRR013884_2.fastq.gz

mkdir tmp 
ls DRR013884_1.fastq.gz DRR013884_2.fastq.gz > FILES 
kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmer_counts tmp 

kmc_dump -ci100 -cx3000 kmer_counts kmer_k21.dump 
smudgeplot.py hetkmers -o kmer_pairs < kmer_k21.dump 
smudgeplot.py plot -o f_iinumae -t "Fragaria iinumae" -q 0.99 kmer_pairs_coverages.tsv

## genomescope
kmc_tools transform kmer_counts histogram kmer_k21.hist -cx10000
Rscript genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Fiinumae_genomescope

Actually, smudgeplot gave the same results as shown in the tutorial and estimated as tetraploid. However, genomescope showed 100Mbp in length with -p 2 and the heterozygosity rate is much higher (13.8%). Is the number following "len:" the estimated genome size or should multiply with p? Can you please give me insights on this? I appreciate you help. Fiinumae_genomescope_linear_plot k21_smudgeplot_log10

KamilSJaron commented 3 years ago

The thing with genomescope is that it's all based on guessing right the 1n coverage. The model you posted got it wrong, instead of 146x it estimated twice as much. As a consequence, the real diploid peak is the haploid peak in the model, and real tetraploid peak is consideed the diploid one. All that leads to an unrealistically high heterozygosity estimate (for a strawberry) and about half of the expected genome size.

The genome size usually means the haploid genome size (i.e. counting each chromosome only once) - this is the value reported by genomescope and also the value you find in genome browsers etc. Flow cytometry and polyploidy folks sometimes also talk about the total genomic content in a cell, and that is ploidy * haploid genome size (what you measure using fc).

So, the answer is the model you show here is wrong and I am nor sure why. It could be just a freak convergence, but I should check if i can reconstruct the tutorial with the latest version of genomescope. Would you mind zipping the kmer histogram and posting it here? I don't think I currently have the data by my hand :-)

fancyge commented 3 years ago

Thank you very much for the quick response. I see the kcov 287 is consistent with the coverage in the x-axis in genomescope. Is the 1n coverage inferred from the highest peak in the genomescope plot? I used genomescope v2.0.

kmer_k21.hist.zip

fancyge commented 3 years ago

Hi Kamil,

I just tried with genomescope v1.0 using the same hist file and this time it gave the same as in the tutorial. This seems strange and I hope you can help me if I missed something in v2. Thank you. plot

KamilSJaron commented 3 years ago

Indeed, the older versions of genomescope certainly converged on the expected coverage (~140x) right away, but the latest version really does not - the default run estimates 1n = 280.

If you specify the coverage prior (-l 140) the model converges as expected (pic attached), but that's not completely satisfying.

I am out of my depth @tbenavi1 ... I recalled this tweet, and tried reducing number of rounds - https://twitter.com/t_rhyker/status/1288863398374014979?s=20, but that changes nothing (also the situation is quite different, here there is soooo much coverage)

Fiinumae_genomescope_linear_plot