Open carlferr opened 1 year ago
Hi @carlferr,
The difference in file structure is due to the various types of fast5 existing, single vs multiple reads per file. But this should not be a problem as nanodisco
can handle both types. To push the diagnosis further, I would use h5dump
or HDFView to look at the exact structure in your fast5
, especially at the /read_000272ba-22a8-4660-bdcd-546de7a4c075/Analyses/Basecall_1D_000/BaseCalled_template
path.
Alan
Hi @carlferr ,
I don't know if you have resolved the issue but I had the same error message and then I realised I used a raw fast5 file with singals only. I think you need to use the basecalled fast5 files.
Tongzhou
Just wanted to follow up on Tongzhou's comment - do the basecalled fast5 files need to be used?
Hi @starkeynight,
Yes, you need to use the basecalled fast5 files generated with the --fast5_out
option.
Alan
HI,
I seem to be running into similar issues with this.
What program is to be used in which the --fast5_out option can be used? Is that guppy? Is that possible to install on a local server, or does this need to be set on the nanopore machine? Is it possible to basecall fast5 files that already have been generated?
Thanks! Arkadiy
Hi @Arkadiy-Garber,
Yes, --fast5_out
is an option for guppy
up to version v6.3.8
. You can install guppy on any machine but would need access to a GPU for the best performance. Yes, all fast5 can be rebasecall at will as the raw sequencing signal is conserved (i.e. variation in current across pore).
Alan
Excellent, thanks Alan! Unfortunately, we do not have a GPU, but have plenty of CPUs. Do you think this will be possible, or will it take too long to do without a GPU?
Thanks again, Arkadiy
I believe this is doable but I have no experience with CPU basecalling of nanopore data. If you have a lot of CPU ressources, maybe try to parallelize as much as possible, maybe even by splitting fast5 into subsets... Let me know how it goes.
Alan
I've experienced this same issue as well, on fast5 files that were basecalled with Guppy v6.3.8. When I use h5dump to look at the problematic read, I see that the "/read_00035d06-15d4-46c6-8538-38e8a2645fa2/Analyses/Basecall_1D_000/BaseCalled_template" field exists and contains fastq-formatted data. I've attached the h5dump output to help with troubleshooting.
Hello @awilson0,
Thank you for reaching out and providing a file for troubleshooting. Could you take a look at this example of analysis with Guppy 6.3.8
to see if a step can fix your issue?
Best,
Alan
@touala, I ran the same basecalling/compression commands in that example and nanodisco preprocess still gave the same error.
Error in { :
task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_00035d06-15d4-46c6-8538-38e8a2645fa2/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
@awilson0, do you mind sending me some of the fast5 with issue (alan.tourancheau@bio.ens.psl.eu)? I don't see a reason right now, and it would help me diagnose the problem.
Best,
Alan
Hello @awilson0,
Thanks again for sharing some of the problematic data.
In your subset of reads, only the WGA dataset triggered an error from nanodisco preprocess
.
[2023-04-13 12:40:49] Localize all fast5 files.
[2023-04-13 12:40:49] Found 1 fast5 files.
[2023-04-13 12:40:49] Extract sequences from fast5.
Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { :
task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_000922a6-2388-4602-a2b2-7ed45d4ef43a/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
Calls: extract.sequence -> %dopar% -> <Anonymous>
Execution halted
Unexpected error during read extraction process.
Looking at the fast5 with HDFView, I found that the underlying issue is that one of the read in the fast5 was not basecalled... I've sanitized the fast5 and rebasecalled it which fixed the error from nanodisco
.
compress_fast5 -t 9 --recursive --sanitize -c gzip -i WGAfast5_basecalled -s WGAfast5_sanitized
/path/to/guppy/6.3.8/bin/guppy_basecaller --disable_pings --gpu_runners_per_device 8 --chunks_per_runner 1024 --chunks_per_caller 10000 --num_callers 5 --cpu_threads_per_caller 1 --recursive -i WGAfast5_sanitized --fast5_out -s WGAfast5_basecalled2 -c dna_r9.4.1_450bps_sup.cfg --device cuda:0
singularity exec nanodisco.sif nanodisco preprocess -p 10 -f WGAfast5_basecalled2 -s wga -o results/preprocessed -r svc1m_polished_assembly.fasta
[2023-04-13 14:34:58] Localize all fast5 files.
[2023-04-13 14:34:58] Found 1 fast5 files.
[2023-04-13 14:34:58] Extract sequences from fast5.
[2023-04-13 14:37:06] All fast5 files processed.
[2023-04-13 14:37:07] Map reads.
[2023-04-13 14:37:31] Index alignment.
[2023-04-13 14:37:31] Done.
But it looks like you would've catch it given this comment. So there might be other problems I've not seen. Would you be able to sanitize and basecall your dataset again?
Best,
Alan
Okay, I figured out what was happening. Somehow, I ended up with the same read in two files. When I basecalled the files individually, I didn't have any issues, but when I basecalled the entire dataset, I would get:
Error loading read "read_id" from file "file": Duplicate read ID detected, skipping second instance of this read ID
I couldn't find a guppy option to exclude skipped reads from the --fast5_out
output, but I wrote a quick Python script to remove them.
Great to hear. If the problem persist for this set of files you can also use fast5_subset
from ont-fast5-api
to subset the fast5 and filter out the problematic reads.
Thank you for the feedback.
Hi, I'm using nanodisco to analyse methylation profile from bacteria isolates. I installed singularity (version 3.5.3) and nanodisco (v1.0.3), then I tried to import my fast5 files in nanodisco workflow.
here the CL:
It seems that everything was fine with the software. Then I tried to run preprocess command but there was an error:
When I run h5ls command on my data (outside nanodisco ), I get following output:
[...] and so on for each read stored in fast5 file.
If I run the same command on fast5 files of nanodisco tutorial (obtained via
get_data_bacteria
inhome/nanodisco/dataset/EC_NAT/
folder ) I obtain:It seems that I have different file structure based on h5ls command. How can I obtain same file structure? Could you help me to solve this issue? Thanks in advance.