LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
167 stars 31 forks source link

Meaning of output in log file #239

Closed pwh124 closed 1 year ago

pwh124 commented 1 year ago

Hello!

I am trying to make sense of some testing we are doing with different configurations of reference genomes for readfish.

I am taking the log files and processing them in R. The processing takes the log file, which looks like this:

2023-05-19 13:00:52,961 ru.ru_gen 96R/0.19188s
2023-05-19 13:00:53,490 ru.ru_gen 303R/0.31916s
2023-05-19 13:00:53,948 ru.ru_gen 370R/0.37610s
2023-05-19 13:00:54,340 ru.ru_gen 301R/0.36585s
2023-05-19 13:00:54,733 ru.ru_gen 359R/0.35646s
2023-05-19 13:00:55,094 ru.ru_gen 295R/0.31505s
2023-05-19 13:00:55,570 ru.ru_gen 347R/0.38963s

And processes it to look something like this (note: not showing the same piece of the data as above): Screen Shot 2023-05-25 at 4 01 05 PM

I am sure there is still some processing to work out, but what is meant by 303R/0.31916s? Is that number of full-length reads processed in that time or does the "303R" mean the number of read chunks processed?

Thanks! Paul

mattloose commented 1 year ago

This is slightly complex because of the way that readfish works. Signal is concatenated over time from each individual chunk from each read. So 303R means that 303 read starts were processed in 0.31916 seconds. Those read starts could consist of one or more chunks of read data depending on how many times the read has been seen before a decision has been made.

I hope that helps.

pwh124 commented 1 year ago

Ahhhh ok, I think I understand.

So if I am processing read 1 and read 2, the initial chunk of both of them would be shown as something like: 2R/0.01s

And then, lets say, a decision is made on read 1 with the initial chunk, but another chunk is needed for read 2. So read 2 would be reported as: 1R/0.02s

So the whole output would be:

2R/0.01s
1R/0.02s

Is that right?

mattloose commented 1 year ago

No not quite.

The time measurement is the total time it has taken to process that number of read starts. But it isn't necessarily true that all read starts are just one chunk in length. Some may be longer than that if they have been seen before.

So you could have 10 reads that consisted of 2 chunks worth of data and then 25 reads that consisted of 1 chunks worth of data. Those 35 reads together took x seconds to process.

Is that clearer?

pwh124 commented 1 year ago

Oh ok, yes this makes sense now. Thanks!