fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

Problem reading the basecaller information #23

Closed mtisza1 closed 1 year ago

mtisza1 commented 3 years ago

Hi Alan,

I'm having issues with Nanodisco reading the basecaller version on my reads. I'm using nanodisco version 1.0.2 on the singularity container. If I run the each step of the pipeline without specifying --basecall_version, it will run through preprocess, difference, merge and motif without error (and the motif results seem correct based on other data), but then, when I run characterize I get this error

[2021-06-21 14:58:04] Load supplied current differences.
[2021-06-21 14:58:04] Check current differences file version.
Error in if (!basecaller %in% c("Albacore", "Guppy")) { : 
  argument is of length zero
Calls: check.model.version
Execution halted

On the other hand, if I run from the beginning using the flag --basecall_version Guppy:4.0.11 on preprocess I get:

Parameter --basecall_version don't match available basecalling. Please provide the basecaller name and version (e.g. Guppy:3.2.4).
Available basecalling are displayed below.
  basecaller version  basecall_group
1    Unknown   4.0.5 Basecall_1D_000

Specifying --basecall_version Unknown:4.0.5 does not resolve the issue.

Essentially, it seems the nanodisco scripts are reading the "MinKNOW-Live-Basecalling" information in the fast5 file (which is near the top of the file) and ignoring the guppy information (which is near the bottom of the file). I don't code R at all, so I can't parse through your scripts to figure out what nanodisco is actually doing, however.

I've attached the first read from a set of fast5 files for a run that this error occurred. These reads were live basecalled on the gridion, and the fastq files all look good from a length and quality standpoint. I hope it will help get to the bottom of this, but I am happy to provide more information.

I'd really love to get this resolved so I can use your tool! Thanks in advance.

0009d4a2-1f1f-4b8d-a7eb-a75d6fa1c401.fast5.zip

mtisza1 commented 3 years ago

OK, so I am looking at your code a little bit and in the extract.R script, lines 16-40, I can see that it is looking for the strings "Guppy" or "Albacore". For whatever reason, my version of the gridION software or minknow is only specifying the guppy version with the attribute "guppy_version". I wonder if this could be solved by changing this part of the code.

touala commented 3 years ago

Hi Mike,

Thank you for providing a detailed description of the issue.

You're definitely on the right track. I've made changes related to live base calling before (in #5) but the fast5 output format might have evolved since.

f5_data <- h5readAttributes(path_first_fast5, path_basecalling) # Extract read data (fastq, move, and trace)
# Is not length(f5_data)==0 now

I'll investigate and fix it, but meanwhile you should be able to bypass this sanity check by modifying the current differences output from nanodisco merge. This won't affect the results. Start R in interactive mode and execute the following commands with proper paths.

nanodisco_res <- readRDS("<path_to_current.RDS>")
attr(nanodisco_res, "basecaller") # Could you also report the output from this command?
attr(nanodisco_res, "version") # Could you also report the output from this command?
attr(nanodisco_res, "basecaller") <- "Guppy"
attr(nanodisco_res, "version") <- "4.0.11"
saveRDS(nanodisco_res,  "<path_to_current_fixed.RDS>")

Please let me know if this is working or if you cannot use R altogether.

Alan

mtisza1 commented 3 years ago

Hi Alan,

Thanks so much for the speedy reply! It is working now, and I got expected outputs from characterize

Here are the outputs you asked for (this is from analyses not using --basecall_version flag)...

Before changing anything:

> attr(nanodisco_res, "basecaller")
NULL
> attr(nanodisco_res, "version")
NULL

After changing the .RDS file:

> attr(nanodisco_res, "basecaller")
[1] "Guppy"
> attr(nanodisco_res, "version")
[1] "4.0.11"

I suspect that ONT has slightly different fast5 formats between minION, gridION, and promethION, as well as slightly different formats for different versions of the software. Personally, I don't envy that you work extensively with this file format.

Cheers,

Mike

touala commented 3 years ago

Hi Mike,

Thank you for reporting the results. I made new fixes that should better handle those live-basecalled reads. You can give it a try with:

singularity pull nanodisco.v1.0.3_dev.sif docker://fanglab/nanodisco:dev

Do you mind running nanodisco difference on one chunk and report again the attr values? I tried on the fast5 you send earlier and it reported Guppy_live and 4.0.11+f1071ce as expected but I'd like to be sure it's ok for all reads.

Thanks.

Alan

P.S.: I'll get back to you about PCR Barcoding later, but thank you very much for the report.

mtisza1 commented 2 years ago

OK, Alan, thanks for the update. I did what you requested, and it appears to be working:

> nanodisco_res <- readRDS("chunk.102.difference.rds")
> attr(nanodisco_res, "basecaller")
[1] "Guppy_live"
> attr(nanodisco_res, "version")
[1] "4.0.11+f1071ce"

I'll let you know if anything seems to go wrong downstream of this.

P.S. Do you foresee any issue in the compatibility of Nanodisco and Guppy version 5?

touala commented 2 years ago

Thank you very much.

I completed a couple of analysis of 5.0.7 datasets (differed base calling) with success. Looking at live-basecalling file structure, I don't see differences so this nanodisco dev version (v1.0.3_dev) should work. However, please note that it's not fully compatible with singularity due to the conversion from docker. I'll try to push the complete version (v1.0.3) today.

However, the new fast5 compression solution (i.e. vbz) broke the compatibility with nanopolish's version I'm currently using. I'm doing some testing with the latest nanopolish version which handle it, as I need to check that the motif typing and fine mapping is not affected (it shouldn't). In the meantime, you can use ont-fast5-api to convert the fast5 encoding to gzip with:

compress_fast5 -t 20 --recursive -i <input_vbz.fast5> -s <input_gzip.fast5> -c gzip

This is not ideal but hopefully I can get v1.0.4 out soon enough.

Alan