tombo preprocess error - Githubissues

ZabalaAitor commented 1 year ago

Hi,

I have followed the README and I obtained error after generating the .fastq file using guppy Basecall.

This are the commands I used:

.../guppy_basecaller -i .../fast5_files/fast5_pass/barcode01/ -r -s fast5s_guppy --config .../guppy/ont-guppy-cpu/data/dna_r9.4.1_450bps_hac_prom.cfg
cat fast5s_guppy/pass/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir .../fast5_files/fast5_pass/barcode01/ --fastq-filenames fast5s_guppy.fastq --sequencing-summary-filenames fast5s_guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10

And this is the error:

[10:14:11] Getting read filenames.
[10:14:11] Parsing sequencing summary files.
******************** WARNING ********************
    Some FASTQ records from sequencing summaries do not appear to have a matching file.
[10:14:11] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0it [00:07, ?it/s]
[10:14:19] Added sequences to a total of 0 reads.

Thanks.

Aitor Zabala

PengNi commented 1 year ago

Hi @ZabalaAitor ,

Maybe this is related to the VBZ compression issue. Please try to add the VBZ plugin to your environment to see if it works.

# download ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz (or newer version) and set HDF5_PLUGIN_PATH
# https://github.com/nanoporetech/vbz_compression/releases
wget https://github.com/nanoporetech/vbz_compression/releases/download/v1.0.1/ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz
tar zxvf ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz
export HDF5_PLUGIN_PATH=/abslolute/path/to/ont-vbz-hdf-plugin-1.0.1-Linux/usr/local/hdf5/lib/plugin

Best, Peng

ZabalaAitor commented 1 year ago

Hi @PengNi ,

Thank you very much for your quick response. It still not working...

Best,

Aitor Zabala

PengNi commented 1 year ago

@ZabalaAitor ， are your fast5s in single-read format? Maybe it is a multi-read format issue. Please check the Usage in README.

Best, Peng

ZabalaAitor commented 1 year ago

It could be... I will let you know if it was a multi-read format issue.

Best, Aitor Zabala

ZabalaAitor commented 1 year ago

It works!

Thanks. Aitor Zabala

sagnikbanerjee15 commented 1 year ago

I am facing a similar error.

[18:46:58] Getting read filenames.
[18:46:59] Parsing sequencing summary files.
******************** WARNING ********************
    Some FASTQ records from sequencing summaries do not appear to have a matching file.
[18:47:07] Annotating FAST5s with sequence from FASTQs.
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tombo/_preprocess.py", line 148, in _feed_seq_records_worker
    fastq_rec = list(islice(fastq_fp, 4))
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I have used multi_to_single_fast5 to convert the fast5 files from multi to single. But this still does not work. Could you please take a look at it?

I am executing the following command:

tombo preprocess annotate_raw_with_fastqs --fast5-basedir  fast5_pass_barcode77 --fastq-filenames  fastq_pass/barcode77/fastq_pass_barcode77.fastq --sequencing-summary-filenames ../sequencing_summary_FAT23762_13f74adb.txt --overwrite --processes 8

Thank you.

PengNi commented 1 year ago

@sagnikbanerjee15 , I am not sure cause I don't encounter this before. Maybe it is a python-version issue? You can also check the tombo repo to see or ask if there is the same problem there.

Best, Peng

PengNi / deepsignal2

tombo preprocess error #13