fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
67 stars 7 forks source link

nanodisco difference can't find the proper basecalling version #5

Closed Nevotii closed 3 years ago

Nevotii commented 4 years ago

Hi!

Thanks for nanodisco, very nice tool!

Running the program in your example dataset works perfectly in my computer. However, when I try it in my data I am getting the following error:

Command

nanodisco difference -nj 5 -nc 1 -p 5 -f 1 -l 513 -i nanodisco/analysis/preprocessed_subset -o nanodisco/analysis/difference_subset -w barcode01 -n barcode07 -r nanodisco/reference/KPA.fasta

Output

local:5/56/100%/2.2s Error in if (grepl("Albacore", f5_data$name)) { :                                                    
argument is of length zero                                                                                            
Calls: check.basecall.version -> extract.basecall.version
Execution halted

The output happens over and over again for all the jobs.

My fast5 were basecalled with guppy 4.0.11+f1071ce and the reads' sequence is extracted without problem with nanodisco preprocess

Any idea what might be happening?

touala commented 4 years ago

Hello @Nevotii,

Thank you for your kind words about this tool.

As you deduced, it seems that nanodisco didn't automatically identify the base calling version from your fast5. I'm suspecting an incompatibility with the newer Guppy version but I will need to investigate further.

I'll try to reproduce it on my side ASAP but, in case I can't, could you report what nanodisco preprocess commands were used for processing both samples (i.e. barcode01 and barcode07)?

Also, do you mind testing the following commands and reporting the outputs (<5 min)?

# Launch nanodisco container in interactive mode
# singularity build --sandbox nd_example nanodisco.sif # Execute only if you don't have the "nd_example" container from the "Tool showcase" anymore 
singularity run --no-home -w nd_example
R # Start R from container
# From within R environment
library(rhdf5)
path_first_fast5 <- "<path_fast5_file>" # Pick one fast5 at random
f5_content <- h5ls(path_first_fast5)

if(any(grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE)){
    # This is a multi-read fast5 file
    path_first_read <- f5_content$group[grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)][1]
    path_basecalling <- gsub("/BaseCalled_template","",path_first_read)
}else if(any(grepl("^/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE)){
    # This is NOT a multi-read fast5 file
    path_basecalling <- "/Analyses/Basecall_1D_000"
}

f5_data <- h5readAttributes(path_first_fast5, path_basecalling) # Extract read data (fastq, move, and trace)

# Write outputs
print(subset(f5_content, grepl(gsub("/(.*)/Analyses.*","\\1",path_first_read),group)))
print(any(grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE))
print(any(grepl("^/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE))
print(path_first_read)
print(path_basecalling)
print(as.data.frame(f5_data))

Alan

Nevotii commented 4 years ago

Thanks for your response!

The nanodisco preprocess commands used were:

nanodisco preprocess -p 4 -f fast5_single2/barcode01_subset -s barcode01 -o nanodisco/analysis/preprocessed_subset -r /scratch/lab_mguell/projects/shared_data/RefGenomes/fastas/Pacnes_KPA_RefGenome.fasta
nanodisco preprocess -p 4 -f fast5_single2/barcode07 -s barcode07 -o nanodisco/analysis/preprocessed_subset -r /scratch/lab_mguell/projects/shared_data/RefGenomes/fastas/Pacnes_KPA_RefGenome.fasta

Running your piece of code on a random fast5 I get the following (Note: all my reads are single fast5)

> library(rhdf5)
> '/gpfs42/projects/lab_mguell/shared_data/methylation/nanopore_full/fast5_single2/barcode01_subset/0ff0261e-72c8-4d58-b
659-f59fe6f8ddcd.fast5'
> path_first_fast5 <- '/gpfs42/projects/lab_mguell/shared_data/methylation/nanopore_full/fast5_single2/barcode01_subset/0ff0261e-72c8-4d58-b659-f59fe6f8ddcd.fast5'
> f5_content <- h5ls(path_first_fast5)
> if(any(grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE)){
+ # This is a multi-read fast5 file
+ path_first_read <- f5_content$group[grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)][1]
+ path_basecalling <- gsub("/BaseCalled_template","",path_first_read)
+ }else if(any(grepl("^/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE)){
+ # This is NOT a multi-read fast5 file
+ path_basecalling <- "/Analyses/Basecall_1D_000"
+ }
>
> f5_data <- h5readAttributes(path_first_fast5, path_basecalling) # Extract read data (fastq, move, and trace)
>
> # Write outputs
> print(subset(f5_content, grepl(gsub("/(.*)/Analyses.*","\\1",path_first_read),group)))
Error in gsub("/(.*)/Analyses.*", "\\1", path_first_read) :
  object 'path_first_read' not found
> print(any(grepl("^/read_.*/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE))
[1] FALSE
> print(any(grepl("^/Analyses/Basecall_1D_000/BaseCalled_template",f5_content$group, perl=TRUE)==TRUE))
[1] TRUE
> print(path_first_read)
Error in print(path_first_read) : object 'path_first_read' not found                                                                                                        > print(path_basecalling)                                                                                                                                                   [1] "/Analyses/Basecall_1D_000"                                                                                                                                             > print(as.data.frame(f5_data))                                                                                                                                             data frame with 0 columns and 0 rows  

Attach you can find 5 of my fast5 files just in case they are useful for you to reproduce the error. fast5.zip

Please let me know if I can help with anything else

touala commented 4 years ago

Thank you for sending some of your fast5. I think I found the issue which seems to be due to the base calling specificity (please see below).

Looking at the fast5, I see that the base caller information is not where it is usually at (i.e. not in /Analyses/Basecall_1D_000 for single read). It also seems that those fast5s only contain the information from the live base calling done during the sequencing run by MinKNOW and not from an independent Guppy base calling. Also the Guppy version seems to be 3.2.10+d9445b2 and not 4.0.11+f1071ce.

One way to support this in general is: I will improve the function to handle fast5s with live base calling. Also, if you decide to rerun Guppy with a more recent version, you would also need to use a development option (--basecall_version <basecaller:version>, e.g. Guppy:4.0.11; not available in the main release) to select which of the two base called sequences you want to use in nanodisco preprocess.

Please bear with me while I update the function. I will get back to you with the new code ASAP, and let you know.

Alan

touala commented 4 years ago

I have implemented the fix which I think can address the issue you raised. Three files need to be replaced: extract.R, preprocess.sh, and difference_functions.R. You can find them here and integrate them to the container by doing:

wget https://github.com/fanglab/nanodisco/files/5024806/nanodisco_feature.zip # Download .zip mentioned above
unzip nanodisco_feature.zip

# Create a writable temporary container (directory) named nd_tmp, ~5 min
singularity build --sandbox nd_tmp nanodisco.sif 

mv extract.R preprocess.sh difference_functions.R nd_tmp/home/nanodisco/code # Replace function with new feature
chmod 755 nd_tmp/home/nanodisco/code/* # Set proper permission

# Create a new container with the additional feature
singularity build nd_env nd_tmp

I would recommend to run nanodisco preprocess again because, given the likely issue, it should have outputted an error message too.

If you decide to rebasecall the fast5, you can now provide the --basecall_version option (<basecaller:version>) to specify which basecalling version you want to use (e.g. Guppy:4.0.14 or Albacore:2.3.4). nanodisco preprocess can be executed as follow:

nanodisco preprocess -p <nb_threads> -f <path_fast5> -s <name_sample> -o <path_output> -r <path_reference_genome> --basecall_version <basecaller:version>

I hope to test the commands more extensively, but I currently have limited access to fast5 with the live basecalling file structure (except yours). I am very happy to help you test further if you meet any additional challenges, or have any questions. So please feel free to let me know.

Alan

Nevotii commented 4 years ago

Hey Alan,

Sorry for the delay in the response but your modifications work just fine now on my fast5 and I could perform the entire analysis.

Thank you very much for your help.