arq5x / poretools

a toolkit for working with Oxford nanopore data
MIT License
239 stars 90 forks source link

When I invoke poretools times command the result is not what is expected. #189

Open PBGLMichaelHall opened 2 years ago

PBGLMichaelHall commented 2 years ago

issue

gringer commented 2 years ago

Poretools was created and developed at a time when fast5 files only had one read per file. Based on the file names, I'd guess you're looking at recent multi-fast5 files (probably from a Flongle), which have multiple reads per file. ONT does provide utilities in their github repostory to convert from one to another, but I expect you'll get a better outcome for what you want by looking directly at the read summary output from basecalling.

PBGLMichaelHall commented 2 years ago

OK.... git clone https://github.com/nanoporetech/ont_fast5_api pip install ./ont_fast5_api

python multi_to_single_fast5.py -i path/to-multi-fast5/directory -s some/output/directory

poretools times /some/output/directory

WARNING:poretools:No start time for fast5.fast5! WARNING:poretools:No start time for fast5.fast5! WARNING:poretools:No start time for fast5.fast5! WARNING:poretools:No start time for fast5.fast5! . . . . It can find keyinfo now but not start times after converting from multi to single!

PBGLMichaelHall commented 2 years ago

I need specific columns of data to be generated by poretools times which is not in the sequencing summary text file generated from a MINION run. These specific data names are read in by a python script. The following data names are what is not generated currently and what is actually needed. Is there a way to generate these data variables with sequencing summary without using poretools times?

exp_starttime unix_timestamp unix_timestamp_end iso_timestamp read_length day hour minute

PBGLMichaelHall commented 2 years ago

A list of data variables the sequencing summary text file generates from a Minion Run;

filename read_id run_id batch_id channel mux start_time duration num_events passes_filtering template_start num_events_template template_duration sequence_length_template mean_qscore_template strand_score_template median_template mad_template scaling_median_template scaling_mad_template

gringer commented 2 years ago

I'll repeat that it's really not a great idea to use this old software for processing new data. It seems odd to need UNIX timestamp values (and derived values) for every single read.

ONT changed their time representation between different versions, and may have altered other things with FAST5 files. I think they changed from absolute time to relative time, so adding unix timestamp values would require fetching the experiment start time from the sequencing logs.

Or you could add a constant timestamp value of 1st January 2000 to everything, to make it really obvious that the timestamps are incorrect.

arq5x commented 2 years ago

Completely agree that this is no longer the toolset to use here. I need to update the README and make it obvious that poretools is deprecated owing to all of the ONT changes.

PBGLMichaelHall commented 2 years ago

Which version of poretools has the correct time representation (UNIX)?