Closed alexanderwg-ornl closed 2 years ago
I did some digging, and my .fast5 files that came off the MinION are structured as such:
/read_ffed8534-2624-4730-ba0c-bc20e047753c Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/Raw Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/Raw/Signal Dataset {9174/Inf} /read_ffed8534-2624-4730-ba0c-bc20e047753c/channel_id Group /read_ffed8534-2624-4730-ba0c-bc20e047753c/context_tags Group, same as /read_001addd8-e28b-481c-8982-9994e7efb8fc/context_tags /read_ffed8534-2624-4730-ba0c-bc20e047753c/tracking_id Group, same as /read_001addd8-e28b-481c-8982-9994e7efb8fc/tracking_id
So there's no Analyses group in my files. Should I rerun Guppy and have it output a fast5 file with the necessary information?
Hello @alexanderwg-ornl,
Thank you for providing this additional bit of information. Yes, you're right the fast5 with included basecalling should be outputted when running Guppy with --fast5_out
. This will generate the Analyses
folders for each read. Please let me know if this fix the issue.
Best,
Alan
Ah, I was literally typing up a response to myself saying I just figured this out! Alright, I'm going to close this since I'm 99% sure this will solve my problem.
Great. Please feel free to re-open it if this is not fixing it, or create a new issue if you have any other problem.
Hello, I am getting the same error message when attempting to pre-process a new dataset and was hoping someone could clarify how to fix this. I basecalled the data after sequencing using guppy and included the --fast5_out flag. Data which was basecalled in MinKnow during sequencin using (as far as I can tell) the same basecalling model (dna_r9.4.1_450bps_hac.cfg) seems to process fine. Is there some difference/output format difference between models or some specific basecalling version which needs to be used for nanodisco? It is quite possible that this is something simple that I was just not aware of.
For context, these are the commands used for basecalling:
guppy_basecaller --input_path wga_temp --save_path wga_temp_repeat --flowcell FLO-MIN106 --kit SQK-LSK109 --device cuda:0 --fast5_out
And the command used with nanodisco preprocess and output text:
nanodisco preprocess -p 12 -f dataset/wga -s wga -o analysis/preprocessed_wga -r reference/DH5_reference_genome.fasta [2022-01-10 09:31:25] Localize all fast5 files. [2022-01-10 09:31:25] Found 20 fast5 files. [2022-01-10 09:31:25] Extract sequences from fast5. Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { : task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_00133eac-82f9-4f04-b9c5-9f8271de8220/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file.""" Calls: extract.sequence -> %dopar% -> <Anonymous> Execution halted Unexpected error during read extraction process.
Any help or advice appreciated.
Further to this, I ran h5ls on my fast5 files and got the structure
/ Group /Analyses Group /Analyses/Basecall_1D_000 Group /Analyses/Basecall_1D_000/Summary Group /Analyses/Basecall_1D_001 Group /Analyses/Basecall_1D_001/BaseCalled_template Group /Analyses/Basecall_1D_001/BaseCalled_template/Fastq Dataset {SCALAR} /Analyses/Basecall_1D_001/BaseCalled_template/Move Dataset {21255} /Analyses/Basecall_1D_001/Summary Group /Analyses/Basecall_1D_001/Summary/basecall_1d_template Group /Analyses/Segmentation_000 Group /Analyses/Segmentation_000/Summary Group /Analyses/Segmentation_001 Group /Analyses/Segmentation_001/Summary Group /Analyses/Segmentation_001/Summary/segmentation Group /Raw Group /Raw/Reads Group /Raw/Reads/Read_142597 Group /Raw/Reads/Read_142597/Signal Dataset {106366/Inf} /UniqueGlobalKey Group /UniqueGlobalKey/channel_id Group /UniqueGlobalKey/context_tags Group /UniqueGlobalKey/tracking_id Group
Which seems to suggest that the Analyses group which nanodisco preprocess
is looking for is under Basecall_1D_001 rather than Basecall_1D_000? This is beyond the edge of my expertise to fix, any suggestions on doing so would be greatly appreciated.
Apologies for the mass of comments but I am posting as I troubleshoot. I think I have found a workaround for the above, by going into extract.R and changing the default basecall group. However, I am now getting the following error message at the nanodisco difference
stage:
local:2/0/100%/0.0s Error in .Call2("new_output_filexp", filepath, append, compress, compression_level, :
cannot open file '/home/nanodisco/analysis/difference_BREX-p507_test/tmp.3_4.preprocessed_p507_test/p507_test.fasta'
Calls: prepare.index ... writeXStringSet ->
This issue seems to persist for both the previously basecalled samples which were preprocessing fine and for the post-basecalled samples for which I altered the extract.R file, so I am not sure my modification to the script is the issue.
Again, any help appreciated!
Hi @wentski,
It's been a while since you posted on this thread but did you found the solution for your issue? Your troubleshooting was correct. The cleanest solution would be to clear the basecalling information from the fast5 and restart the analysis. This can be done with compress_fast5 --sanitize
from ONT API.
Feel free to open a new ticket if you still have still have an issue.
Alan
Greetings, I'm currently running into an error in Nanodisco that I can't seem to figure out. I have Nanodisco's Singularity container installed on my Linux workstation, and I'm using this command to begin the workflow on some Geobacillus nanopore data I generated:
nanodisco preprocess -p 30 -f dataset/ -o analysis/preprocessed_subset -r /home/nanodisco/test_Geo.fasta -s Geo
The ETA and the progress bar never change, and the error is thrown after about five minutes.
[2021-09-07 09:15:29] Localize all fast5 files. [2021-09-07 09:15:29] Found 351 fast5 files. [2021-09-07 09:15:29] Extract sequences from fast5. Processed fast5 [-------------------------] 0% eta: ?s (elapsed: 00:00:00)Error in { : task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_004f4ce5-b3a5-4583-aa81-6e7acf854000/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file.""" Calls: extract.sequence -> %dopar% -> <Anonymous> Execution halted Unexpected error during read extraction process.
Any assistance would be greatly appreciated.