read does not exist in HDF5 file

hamid89 commented 11 months ago

Hello,

I used the nanodisco preprocess with the following command:

nanodisco preprocess -p 10 -f original_DNA_fast5/ -s native_samples -r assembly/meta_assembly.fasta -o .bam

got an error task 1 failed with following message:

Object '/read_001939f8-f602-4a6a-b610-06b1f2166001/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file

can you please guide me through the issue what is wrong I am doing.

Thank you.

Best,

Hamid

replikation commented 11 months ago

same issue, i used the fast 5 that minknow is producing. i also dont see any fastoutput anymore on recent guppy versions. so i dont have any clue how to get the fastq "base info" into the fast5 data. is it possible to run the preprocess with fast5 and fast5 as input?

hamid89 commented 11 months ago

@replikation I believe I sorted out the problem (still waiting for nanodisco preprocess results). you need to use guppy on fast5 raw signal data with the option '--fast5_out'. The resultant fast5 files you will get in /your_basecalled_directory/workspace. These fast5 files you give to 'nanodisco preprocess' command.

replikation commented 10 months ago

@hamid89

i think this flag is gone now?

guppy_basecaller --fast5_out
Unexpected token '--fast5_out' on command-line.

this is from guppy version

guppy Basecalling Software, (C) Oxford Nanopore Technologies plc. Version 6.5.7+ca6d6af

hamid89 commented 10 months ago

I am not aware of the newer version of guppy because it is going to be replaced by dorado anyway. May I ask which flow cells data are you using? Nanodisco supports R.9.4 flow cells and you need to have whole genome amplification as well as native DNA sequencing of the same samples.

replikation commented 10 months ago

ah we are using 10.4.1 and yes with both read sets available. Dorado basecaller help:

Positional arguments:
  model                         the basecaller model to run. 
  data                          the data directory. 

Optional arguments:
  -h, --help                    shows help message and exits 
  -v, --verbose                 
  -x, --device                  device string in format "cuda:0,...,N", "cuda:all", "metal", "cpu" etc.. [default: "cuda:all"]
  -l, --read-ids                A file with a newline-delimited list of reads to basecall. If not provided, all reads will be basecalled [default: ""]
  --resume-from                 Resume basecalling from the given HTS file. Fully written read records are not processed again. [default: ""]
  -n, --max-reads               [default: 0]
  --min-qscore                  [default: 0]
  -b, --batchsize               if 0 an optimal batchsize will be selected. batchsizes are rounded to the closest multiple of 64. [default: 0]
  -c, --chunksize               [default: 10000]
  -o, --overlap                 [default: 500]
  -r, --recursive               Recursively scan through directories to load FAST5 and POD5 files 
  --modified-bases              [nargs: 1 or more] 
  --modified-bases-models       a comma separated list of modified base models [default: ""]
  --modified-bases-threshold    the minimum predicted methylation probability for a modified base to be emitted in an all-context model, [0, 1] [default: 0.05]
  --emit-fastq                  Output in fastq format. 
  --emit-sam                    Output in SAM format. 
  --emit-moves                  
  --reference                   Path to reference for alignment. [default: ""]
  -k                            k-mer size for alignment with minimap2 (maximum 28). [default: 15]
  -w                            minimizer window size for alignment with minimap2. [default: 10]
  -I                            minimap2 index batch size. [default: "16G"]

hamid89 commented 10 months ago

But nanodisco doesn't support R10 flow cells generated data

replikation commented 10 months ago

okay that's kind of an issue then for this workflow. but thanks for responding

fanglab / nanodisco

read does not exist in HDF5 file #77