how to incorporate information from smudgeplot and genomescope and understand coverage

fancyge commented 3 years ago

Hi Kamil,

I'm using smudgeplot and genomescope2 to understand my data. I have paired reads from 10X genomics and I have used following command.

kmc -k21 -t16 -m64 -ci1 -cs10000 @FILES kmcdb tmp  kmc_tools transform kmcdb histogram kmcdb_k21.hist -cx10000 L=$(smudgeplot.py cutoff kmcdb_k21.hist L)  U=$(smudgeplot.py cutoff kmcdb_k21.hist U)  echo $L $U (L=168, U=980)

kmc_tools transform kmcdb -ci"$L" -cx"$U" dump -s kmcdb_L"$L"_U"$U".dump  smudgeplot.py hetkmers -o kmcdb_L"$L"_U"$U" < kmcdb_L"$L"_U"$U".dump  smudgeplot.py plot kmcdb_L"$L"_U"$U"_coverages.tsv -k 21 -o k21

and it look like this: k21_smudgeplot

The result showed proposed diploid with another smaller smudge AAB. This species is rarely studied and I have few information on the ploidy level or anything of genome structure. I was told the current estimates (800mb) from flow cytometry in our lab cannot be taken seriously either. Also, the initial de novo assembly from different assembly tools gave different sizes as well, ranging from 300-700mb (we also combined reads from illumina and ont for assembly). I'm using smudgeplot and genomescope2 for information of the genome size and structure to re-examine my assembly and investigate further.

My question is how should I incorporate these two tools together when there is no other information for the species? I first tried to feed genomescope with diploid and got "len: Infbp" and quite high duplication rate. Obviously something is wrong but I don't know. The coverage seems not correct as well since I used lots of data.

Rscript genomescope.R -i kmcdb_k21.hist -o . -p 2 -k 21 -n test n5_gs_m100000_p2_linear_plot

I then tried with different ploidy numbers (-p 3; -p 4) which gave me a different picture (linear plots below).

n5_gs_m100000_p3_linear_plot n5_gs_m100000_p4_linear_plot

The dup rates are all very high. Is this signal consistent with the smudge of AAB and does this indicate really high duplication level in the genome? Alternatively, I used a full hist by setting (-cs5e8) to include high repetitive kmers and this gave the same results as above for each ploidy number settings.

Can you please help me with interpreting these results? Can I trust this result based on these information that the estimated genome size is around 200mb? Are the 1n coverage in smudgeplot supposed to be matching the "kcov" in genomescope? I also want to know if it is appropriate to use the ploidy level inferred from smudgeplot to feed genomescope. I appreciate your help!

KamilSJaron commented 3 years ago

Hi @fancyge, sorry for the delay, I just noticed this issue (I must have accidentally clicked on "read").

Your 10x dataset looks really messy. Which is the reason why both smudgeplot and genomescope struggle to fit any model to the data. I suppose, given the flow cytometry, the real 1n coverage will be somewhere in 30 - 100x ballpark, which means that it's impossible to efficiently separate error kmers and real genomic haploid kmers, the two distributions are overlapping too much.

Unlike smudgeplot, genomescope has an ambition to give you at least some idea about the genome, but you might need to give it a more infomation. I would try to fit a diploid model with priorts on the haploid coverage (I think it's -l agument in genomescope). Just to see if it will converge somewhere.

However, that's as much you would be able to get from profiling this dataset, you can try to assemble it, and look at the contamination (as one the possible sources of these messed signals) and also consider getting better data (sorry if it's not a very helpful advice, but there is always just so much you can do when the data go sideways)

KamilSJaron commented 2 years ago

closing due to inactivity, please reopen if you would have more questions.

KamilSJaron / smudgeplot

how to incorporate information from smudgeplot and genomescope and understand coverage #85