arq5x / poretools

a toolkit for working with Oxford nanopore data
MIT License
239 stars 90 forks source link

yield_plot numbers and axis not making sense... #148

Open nickschurch opened 7 years ago

nickschurch commented 7 years ago

I just ran the following command on some recent Nanopore DRS data:

poretools yield_plot --plot-type reads --saveas read_yield.png mydata/fast5

The resulting plot is: read_yield.

The problem is clear. There are ~1.5e5 reads, ~1e8 bp, but the y-axis is suggesting there is 1e10(!) reads...

This is poretools 0.5.1, running on python 2.7.12, on centos 6

nickloman commented 7 years ago

Can you check with latest poretools installed from this repo?

nickschurch commented 7 years ago

Sorry, I got the wrong version number:

> poretools -v
poretools 0.6.0

I'm running it on our cluster here though and I couldn't persuade some of the deps. to install locally, so I've had to get this installed byt he cluster admin guys, which means fast version checking is gonna be tricky.

donutbrew commented 6 years ago

I also have wondered what the y-axis means with these plots, and I thought I was missing something. It is wacky, and always has high numbers. I just replicated this issue with a fresh build. of poretools. I also notice that the number of reads reported in the title of the reads plot is sometimes different from the number of reads reported with poretools stats. (Python 2.7.5, matplotlib 2.0.0)

jessMaia commented 6 years ago

There is indeed a bug in this function (poretools yield_plot --plot-type reads) and this appears to be the explanation (I figured this out by inspecting the data frame given by the option ----savedf).

Poretools should plot the first column (not labeled) vs. 'start' (time). Instead it's plotting the 'cumul' vs. 'start'. The first column corresponds to the number of reads. The 'cumul' column for some reason is being computed as:

cumul[i] = first_column[i] + cumul[i-1]', where i=1,..,(number reads-1)

I'm attaching my reads yield plot. ont12 reads

I have 2,401,754 reads which corresponds to the first column. Instead of plotting that number, poretools is plotting the 'cumul' column whose last value is '2884209937381'.

Here's a slice of my data frame: [data frame beginning] cumul lengths start 0 0 407 1e-08 1 1 489 1e-08 2 3 525 1e-08 3 6 599 1e-08 4 10 612 1e-08 5 15 654 1e-08 .... [data frame end] 2401751 2884205133876 439 23.9969444544 2401752 2884207535628 468 23.9969444544 2401753 2884209937381 530 23.9969444544