arangrhie / merfin

Evaluate variant calls and its combination with k-mer multiplicity
Apache License 2.0
64 stars 5 forks source link

Input for cartesian plot #6

Closed ASLeonard closed 3 years ago

ASLeonard commented 3 years ago

I haven't run merfin yet for the illumina data I have yet, but wasn't entirely clear on the usage of the cartesian plot scripts. The input for cartesian_plot.R is the output of simplify_dump.sh, and the input for that should be $1=illumina.dump and $2=hifi.dump?

ASLeonard commented 3 years ago

I tried it anyway with with cut ... $illum.dump | paste $hifi.dump - | ..., so the axes may be flipped from the labels.

This was using the merged hap1 + hap2 fasta file with hifi and short reads, but the short reads had fairly lower coverage (~16x).

merged correlation

There is an approximate R of -0.03, but the top three values below accounted for ~ 61% of all points, and so probably bias that heavily.

3185301570      0.00    0.00
348440924       0.00    -1.00
143715045       0.00    1.00

It is interesting that the two axes are pretty heavily populated, but not the diagonal. I guess this may demonstrates that kmer bias for hifi is pretty independent of kmer bias for short reads?

arangrhie commented 3 years ago

Hi @ASLeonard , just saw this now. Sorry for the silence!

Yes, as far as I can tell, the k-mer bias was independent, so to speak. The different error modes in HiFi and Illumina seem to be the cause of this; we found homopolymer and microsatellite contraction in HiFi reads and the long-known GC biases in Illumina reads as shown here in T2T-CHM13.