Object does not exist in this HDF5 file

alexanderwg-ornl commented 2 years ago

Greetings, I'm currently running into an error in Nanodisco that I can't seem to figure out. I have Nanodisco's Singularity container installed on my Linux workstation, and I'm using this command to begin the workflow on some Geobacillus nanopore data I generated:

nanodisco preprocess -p 30 -f dataset/ -o analysis/preprocessed_subset -r /home/nanodisco/test_Geo.fasta -s Geo

The ETA and the progress bar never change, and the error is thrown after about five minutes.

[2021-09-07 09:15:29] Localize all fast5 files. [2021-09-07 09:15:29] Found 351 fast5 files. [2021-09-07 09:15:29] Extract sequences from fast5. Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { : task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_004f4ce5-b3a5-4583-aa81-6e7acf854000/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file.""" Calls: extract.sequence -> %dopar% -> <Anonymous> Execution halted Unexpected error during read extraction process.

Any assistance would be greatly appreciated.

alexanderwg-ornl commented 2 years ago

I did some digging, and my .fast5 files that came off the MinION are structured as such: /read_ffed8534-2624-4730-ba0c-bc20e047753c Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/Raw Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/Raw/Signal Dataset {9174/Inf} /read_ffed8534-2624-4730-ba0c-bc20e047753c/channel_id Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/context_tags Group, same as /read_001addd8-e28b-481c-8982-9994e7efb8fc/context_tags /read_ffed8534-2624-4730-ba0c-bc20e047753c/tracking_id Group, same as /read_001addd8-e28b-481c-8982-9994e7efb8fc/tracking_id

So there's no Analyses group in my files. Should I rerun Guppy and have it output a fast5 file with the necessary information?

touala commented 2 years ago

Hello @alexanderwg-ornl,

Thank you for providing this additional bit of information. Yes, you're right the fast5 with included basecalling should be outputted when running Guppy with --fast5_out. This will generate the Analyses folders for each read. Please let me know if this fix the issue.

Best,

Alan

alexanderwg-ornl commented 2 years ago

Ah, I was literally typing up a response to myself saying I just figured this out! Alright, I'm going to close this since I'm 99% sure this will solve my problem.

touala commented 2 years ago

Great. Please feel free to re-open it if this is not fixing it, or create a new issue if you have any other problem.

wentski commented 2 years ago

Hello, I am getting the same error message when attempting to pre-process a new dataset and was hoping someone could clarify how to fix this. I basecalled the data after sequencing using guppy and included the --fast5_out flag. Data which was basecalled in MinKnow during sequencin using (as far as I can tell) the same basecalling model (dna_r9.4.1_450bps_hac.cfg) seems to process fine. Is there some difference/output format difference between models or some specific basecalling version which needs to be used for nanodisco? It is quite possible that this is something simple that I was just not aware of.

For context, these are the commands used for basecalling: guppy_basecaller --input_path wga_temp --save_path wga_temp_repeat --flowcell FLO-MIN106 --kit SQK-LSK109 --device cuda:0 --fast5_out

And the command used with nanodisco preprocess and output text: nanodisco preprocess -p 12 -f dataset/wga -s wga -o analysis/preprocessed_wga -r reference/DH5_reference_genome.fasta [2022-01-10 09:31:25] Localize all fast5 files. [2022-01-10 09:31:25] Found 20 fast5 files. [2022-01-10 09:31:25] Extract sequences from fast5. Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { : task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_00133eac-82f9-4f04-b9c5-9f8271de8220/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file.""" Calls: extract.sequence -> %dopar% -> <Anonymous> Execution halted Unexpected error during read extraction process.

Any help or advice appreciated.

wentski commented 2 years ago

Further to this, I ran h5ls on my fast5 files and got the structure

/ Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_001 Group /Analyses/Basecall_1D_001/BaseCalled_template Group /Analyses/Basecall_1D_001/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_001/BaseCalled_template/Move Dataset {21255} /Analyses/Basecall_1D_001/Summary Group /Analyses/Basecall_1D_001/Summary/basecall_1d_template Group /Analyses/Segmentation_000 Group /Analyses/Segmentation_000/Summary Group /Analyses/Segmentation_001 Group /Analyses/Segmentation_001/Summary Group /Analyses/Segmentation_001/Summary/segmentation Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_142597 Group /Raw/Reads/Read_142597/Signal Dataset {106366/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group

Which seems to suggest that the Analyses group which nanodisco preprocess is looking for is under Basecall_1D_001 rather than Basecall_1D_000? This is beyond the edge of my expertise to fix, any suggestions on doing so would be greatly appreciated.

wentski commented 2 years ago

Apologies for the mass of comments but I am posting as I troubleshoot. I think I have found a workaround for the above, by going into extract.R and changing the default basecall group. However, I am now getting the following error message at the nanodisco difference stage:

local:2/0/100%/0.0s Error in .Call2("new_output_filexp", filepath, append, compress, compression_level, : cannot open file '/home/nanodisco/analysis/difference_BREX-p507_test/tmp.3_4.preprocessed_p507_test/p507_test.fasta' Calls: prepare.index ... writeXStringSet -> -> new_output_filexp -> .Call2 Execution halted

This issue seems to persist for both the previously basecalled samples which were preprocessing fine and for the post-basecalled samples for which I altered the extract.R file, so I am not sure my modification to the script is the issue.

Again, any help appreciated!

touala commented 2 years ago

Hi @wentski,

It's been a while since you posted on this thread but did you found the solution for your issue? Your troubleshooting was correct. The cleanest solution would be to clear the basecalling information from the fast5 and restart the analysis. This can be done with compress_fast5 --sanitize from ONT API.

Feel free to open a new ticket if you still have still have an issue.

Alan

fanglab / nanodisco

Object does not exist in this HDF5 file #27