fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

nanodisco preprocess ERROR #49

Open carlferr opened 1 year ago

carlferr commented 1 year ago

Hi, I'm using nanodisco to analyse methylation profile from bacteria isolates. I installed singularity (version 3.5.3) and nanodisco (v1.0.3), then I tried to import my fast5 files in nanodisco workflow.

here the CL:

singularity pull --name nanodisco.sif library://fanglab/default/nanodisco
singularity verify nanodisco.sif
Verifying partition: FS:
45AD365F84BDF7402CC3CA83F93AC2888FC02443
[REMOTE]  Alan Tourancheau <alan.tourancheau@gmail.com>
[OK]      Data integrity verified
INFO:    Container verified: nanodisco.sif
singularity build --sandbox nd_env nanodisco.sif

It seems that everything was fine with the software. Then I tried to run preprocess command but there was an error:

root@nanodisco:~$ nanodisco preprocess -p 4 -f dataset/test/ -s A_BAU_01 -o analysis/preprocessed_subset -r reference/GCF_008632635.1_ASM863263v1_genomic.fna
[2022-08-05 15:23:02] Localize all fast5 files.
[2022-08-05 15:23:02]     Found 1 fast5 files.
[2022-08-05 15:23:02] Extract sequences from fast5.
 Processed fast5 [-------------------------]   0% eta:  ?s (elapsed: 00:00:00)Error in { : 
  task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_000272ba-22a8-4660-bdcd-546de7a4c075/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
Calls: extract.sequence -> %dopar% -> <Anonymous>
Execution halted
Unexpected error during read extraction process.

When I run h5ls command on my data (outside nanodisco ), I get following output:

(base) [root@localhost EC_NAT]# h5ls /mnt/NFS_SHARE_17/A.baumannii_FRANCI/fast5_pass/barcode01/FAQ13937_pass_barcode01_9525cb6c_0.fast5 
read_000272ba-22a8-4660-bdcd-546de7a4c075 Group
read_001c24e0-a0a6-4893-80b0-b99a8315f045 Group
read_001f353e-ad41-4873-8110-21d5efe4d469 Group
read_0024dd71-2d5c-401a-8f76-8740cb5d9b04 Group
read_002edada-f555-4be4-9c80-e412549d8f3e Group
read_00389471-67f0-4142-9f72-a03a0cc33529 Group
read_004598b6-2d1d-40f7-9361-291926673138 Group

[...] and so on for each read stored in fast5 file.

If I run the same command on fast5 files of nanodisco tutorial (obtained via get_data_bacteria in home/nanodisco/dataset/EC_NAT/ folder ) I obtain:

(base) [root@localhost EC_NAT]# h5ls EC_NAT.read1.fast5 
Analyses                 Group
Raw                      Group
UniqueGlobalKey          Group

It seems that I have different file structure based on h5ls command. How can I obtain same file structure? Could you help me to solve this issue? Thanks in advance.

touala commented 1 year ago

Hi @carlferr,

The difference in file structure is due to the various types of fast5 existing, single vs multiple reads per file. But this should not be a problem as nanodisco can handle both types. To push the diagnosis further, I would use h5dump or HDFView to look at the exact structure in your fast5, especially at the /read_000272ba-22a8-4660-bdcd-546de7a4c075/Analyses/Basecall_1D_000/BaseCalled_template path.

Alan

tongzhouxu commented 1 year ago

Hi @carlferr ,

I don't know if you have resolved the issue but I had the same error message and then I realised I used a raw fast5 file with singals only. I think you need to use the basecalled fast5 files.

Tongzhou

starkeynight commented 1 year ago

Just wanted to follow up on Tongzhou's comment - do the basecalled fast5 files need to be used?

touala commented 1 year ago

Hi @starkeynight,

Yes, you need to use the basecalled fast5 files generated with the --fast5_out option.

Alan

Arkadiy-Garber commented 1 year ago

HI,

I seem to be running into similar issues with this.

What program is to be used in which the --fast5_out option can be used? Is that guppy? Is that possible to install on a local server, or does this need to be set on the nanopore machine? Is it possible to basecall fast5 files that already have been generated?

Thanks! Arkadiy

touala commented 1 year ago

Hi @Arkadiy-Garber,

Yes, --fast5_out is an option for guppy up to version v6.3.8. You can install guppy on any machine but would need access to a GPU for the best performance. Yes, all fast5 can be rebasecall at will as the raw sequencing signal is conserved (i.e. variation in current across pore).

Alan

Arkadiy-Garber commented 1 year ago

Excellent, thanks Alan! Unfortunately, we do not have a GPU, but have plenty of CPUs. Do you think this will be possible, or will it take too long to do without a GPU?

Thanks again, Arkadiy

touala commented 1 year ago

I believe this is doable but I have no experience with CPU basecalling of nanopore data. If you have a lot of CPU ressources, maybe try to parallelize as much as possible, maybe even by splitting fast5 into subsets... Let me know how it goes.

Alan

awilson0 commented 1 year ago

I've experienced this same issue as well, on fast5 files that were basecalled with Guppy v6.3.8. When I use h5dump to look at the problematic read, I see that the "/read_00035d06-15d4-46c6-8538-38e8a2645fa2/Analyses/Basecall_1D_000/BaseCalled_template" field exists and contains fastq-formatted data. I've attached the h5dump output to help with troubleshooting.

read_00035d06-15d4-46c6-8538-38e8a2645fa2.txt

touala commented 1 year ago

Hello @awilson0,

Thank you for reaching out and providing a file for troubleshooting. Could you take a look at this example of analysis with Guppy 6.3.8 to see if a step can fix your issue?

Best,

Alan

awilson0 commented 1 year ago

@touala, I ran the same basecalling/compression commands in that example and nanodisco preprocess still gave the same error.

Error in { : 
  task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_00035d06-15d4-46c6-8538-38e8a2645fa2/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
touala commented 1 year ago

@awilson0, do you mind sending me some of the fast5 with issue (alan.tourancheau@bio.ens.psl.eu)? I don't see a reason right now, and it would help me diagnose the problem.

Best,

Alan

touala commented 1 year ago

Hello @awilson0,

Thanks again for sharing some of the problematic data.

In your subset of reads, only the WGA dataset triggered an error from nanodisco preprocess.

[2023-04-13 12:40:49] Localize all fast5 files.
[2023-04-13 12:40:49]     Found 1 fast5 files.
[2023-04-13 12:40:49] Extract sequences from fast5.
 Processed fast5 [-------------------------]   0% eta:  ?s (elapsed: 00:00:00)Error in { :
  task 1 failed - "task 1 failed - "task 1 failed - "Object '/read_000922a6-2388-4602-a2b2-7ed45d4ef43a/Analyses/Basecall_1D_000/BaseCalled_template' does not exist in this HDF5 file."""
Calls: extract.sequence -> %dopar% -> <Anonymous>
Execution halted
Unexpected error during read extraction process.

Looking at the fast5 with HDFView, I found that the underlying issue is that one of the read in the fast5 was not basecalled... I've sanitized the fast5 and rebasecalled it which fixed the error from nanodisco.

compress_fast5 -t 9 --recursive --sanitize -c gzip -i WGAfast5_basecalled -s WGAfast5_sanitized
/path/to/guppy/6.3.8/bin/guppy_basecaller --disable_pings --gpu_runners_per_device 8 --chunks_per_runner 1024 --chunks_per_caller 10000 --num_callers 5 --cpu_threads_per_caller 1 --recursive -i WGAfast5_sanitized --fast5_out -s WGAfast5_basecalled2 -c dna_r9.4.1_450bps_sup.cfg --device cuda:0
singularity exec nanodisco.sif nanodisco preprocess -p 10 -f WGAfast5_basecalled2 -s wga -o results/preprocessed -r svc1m_polished_assembly.fasta
[2023-04-13 14:34:58] Localize all fast5 files.
[2023-04-13 14:34:58]     Found 1 fast5 files.
[2023-04-13 14:34:58] Extract sequences from fast5.
[2023-04-13 14:37:06] All fast5 files processed.
[2023-04-13 14:37:07] Map reads.
[2023-04-13 14:37:31] Index alignment.
[2023-04-13 14:37:31] Done.

But it looks like you would've catch it given this comment. So there might be other problems I've not seen. Would you be able to sanitize and basecall your dataset again?

Best,

Alan

awilson0 commented 1 year ago

Okay, I figured out what was happening. Somehow, I ended up with the same read in two files. When I basecalled the files individually, I didn't have any issues, but when I basecalled the entire dataset, I would get:

Error loading read "read_id" from file "file": Duplicate read ID detected, skipping second instance of this read ID

I couldn't find a guppy option to exclude skipped reads from the --fast5_out output, but I wrote a quick Python script to remove them.

touala commented 1 year ago

Great to hear. If the problem persist for this set of files you can also use fast5_subset from ont-fast5-api to subset the fast5 and filter out the problematic reads.

Thank you for the feedback.