fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

nanodisco difference: object 'path_basecalling' not found #7

Closed mloleary-0 closed 2 years ago

mloleary-0 commented 3 years ago

Hi,

This looks like a great tool and I'm excited to use it on my data. I was encountering similar issues to #4 and #5, made the suggested modifications (replacing extract.R, preprocess.sh, and difference_functions.R), and successfully preprocessed my data using --basecall_version with reads called by guppy versions 3.6.1.

However, when I run:

nanodisco difference -nj 2 -nc 1 -p 2 -f 100 -l 110 -i analysis/preprocessed_subset -o analysis/difference_subset -w SL_WGA -n SL_WGS -r reference/SL.fasta

I get this error for each chunk:

Error in h5readAttributes(path_first_fast5, path_basecalling) :
   object 'path_basecalling' not found
Calls: check.basecall.version ... extract.basecall.version -> h5readAttributes -> H5Lexists
Execution halted

Any insight into this error? I may have something configured incorrectly, but I can't figure it out. I re-called my reads with guppy version 4.0.15 and repeated the process ( specifying --basecall_version Guppy:4.0.15 ), but that doesn't seem to make a difference.

Thanks!

touala commented 3 years ago

Hello @mloleary-0,

Thank you for your interests in using our tool and finding it helpful.

The issue appended in a sanity check function intended to warn users when different basecalling versions are used on the native and the WGA reads. Thus, this is not a critical function if nanodisco preprocess was run on the same fast5 basecalling version. It appears that the issue is due to the basecalling information being in an unexpected location within the fast5 files.

To help you resolve the issue ASAP, I've made a temporary fix that bypasses it and assigns placeholder values as basecalling information (i.e. <basecaller:version> as Temporary:v0.0.0). Four files need to be replaced (including #4 and #5 fixes): extract.R, preprocess.sh, difference_chunks.R, and difference_functions.R. You can find them here and integrate them to the container by doing:

wget https://github.com/fanglab/nanodisco/files/5126169/nanodisco_quick_solution.zip # Download .zip mentioned above
unzip nanodisco_quick_solution.zip

# Create a writable temporary container (directory) named nd_tmp, ~5 min
singularity build --sandbox nd_tmp nanodisco.sif 

mv extract.R preprocess.sh difference_functions.R difference_chunks.R nd_tmp/home/nanodisco/code # Replace function with new feature
chmod 755 nd_tmp/home/nanodisco/code/* # Set proper permission

# Create a new container with the additional feature
singularity build nd_env nd_tmp

I hope to test the commands more extensively but I don't know what was the exact fast5 file structure that caused the issue. If you would like to share one of your fast5, I am very happy to help you test further.

In any case, if you meet any additional challenges, or have any questions. So please feel free to let me know.

More generally, I will also try to make the aforementioned function more robust for similar use cases.

Alan

mloleary-0 commented 3 years ago

Hi Alan,

Thanks for your speedy reply. I integrated the new files into the container and took my data to the end of the nanodisco motif (which was my goal). While it did identify five motifs (including one with wet-lab support!) and generate two plots (RefineMotifs and Refine_Motifsscores ) per motif, it did not produce a .csv file, and gave the following error:

[2020-08-26 22:02:26] Discovered motifs: GACNNNNNGTG GCCAGG TTCGAA CACNNNNNGTC GGANNNNNNNAATC.
[1] "Unexpected error."
<simpleError in automated && is.null(potential_motifs): invalid 'x' type in 'x && y'>
Final results:
NA
[2020-08-26 22:02:26] Done.

I've attached one fast5 file from each run, to help with troubleshooting.

fast5s_08272020.zip

Regarding nanodisco motif, is there a way to adjust the number of motifs that are returned? I am unclear how the script determines the number of motifs to look at.

Thanks!

Mike

touala commented 3 years ago

Hi Mike,

We are happy that you already found some of the methylation motifs. Thank you again for providing this useful feedback as it helps us make our tool more robust and generally applicable.

The nanodisco motif command will cycle through multiple rounds of motif detection until no significant motif can be found. Each round, up to 5 motifs can be detected from the top candidate regions. During following rounds, the signal coming from all previously discovered motifs is removed so that new candidate regions are examined and new motifs can be discovered.

Regarding the remaining issue you described, we have a solution in mind and have already started implementation. Please note that it is likely that the motif detection you performed didn't run to completion due to this issue. We will let you know as soon as it is updated.

Regarding your previous issue, it appears that fast5 files from the WGS run contain two base calling directories, of which one is empty. I'm not sure how it got there, maybe due to an aborted base calling... The fixed nanodisco difference command shared in my previous response shouldn't be affected by this. I will reinforce the check.basecall.version function to handle this specific case in another update and will also let you know.

Thank you again for your interest in nanodisco,

Alan

touala commented 3 years ago

Hi Mike,

I have implemented the fix which I think can address the issue you raised. Two files need to be replaced: motif.R, and analysis_function.R. You can find them here and integrate them to the container like previously by doing:

wget https://github.com/fanglab/nanodisco/files/5146346/nanodisco_motif_fix.zip # Download .zip mentioned above
unzip nanodisco_motif_fix.zip

# Create a writable temporary container (directory) named nd_tmp, ~5 min
singularity build --sandbox nd_tmp nanodisco.sif 

mv motif.R analysis_function.R nd_tmp/home/nanodisco/code # Replace function with new feature
chmod 755 nd_tmp/home/nanodisco/code/* # Set proper permission

# Create a new container with the additional feature
singularity build nd_env nd_tmp

Please let me know if this fixes the issue you encounter.

I've also reworked the temporary fix and reinforced the sanity check function to better handle rare fast5 structures. The seven files to be replaced can be found here. This fix contains all previous ones (including the one related to motif detection). You can integrate it as described previously (see below).

wget https://github.com/fanglab/nanodisco/files/5146359/nanodisco_clean_solution.zip # Download .zip mentioned above
unzip nanodisco_clean_solution.zip

# Create a writable temporary container (directory) named nd_tmp, ~5 min
singularity build --sandbox nd_tmp nanodisco.sif 

mv analysis_functions.R  difference.sh  difference_chunks.R  difference_functions.R  extract.R  motif.R  preprocess.sh nd_tmp/home/nanodisco/code # Replace function with new feature
chmod 755 nd_tmp/home/nanodisco/code/* # Set proper permission

# Create a new container with the additional feature
singularity build nd_env nd_tmp

I tried to test the commands extensively, although there might be some rare edge cases I overlooked. In any case, I am very happy to help you test further if you meet any additional challenges, or have any questions. So please feel free to let me know.

Alan

mloleary-0 commented 3 years ago

Hi Alan,

Thanks for the update. I tried both fixes and they let nanodisco motif finish without error.

One possible issue: While refinement is asked for and plots are made for the first set of five motifs, the only motifs listed as identified in a second and third round are the same as the first set, whereas the actual output of meme (in folders meme_2 and meme_3 ) is different from the first set. No prompt is generated to refine any motifs (new or old), and no new plots are made in folders meme_2 or meme_3.

Is this expected behavior? The evalue and number of sites for the motifs in the second round are much lower (less than 1e-30 and 80 respectively), so I would understand if they are being discarded. However, one is clearly a reverse complement of an asymmetric motif for the first round, so it would be useful to see refinement plots. I realize this could be a limit of the data I'm using, rather than an issue with nanodisco. In either event, I am generating another set of fast5s to test and compare.

Mike

touala commented 3 years ago

It's great knowing the fixes were successful.

From your description, I guess you are already using the "manual" mode for the motif detection procedure (without the –a option). However, if the -a option was used, please note that we opted for a conservative approach when considering the motifs to be "discovered" and we recommend using the manual approach for in-depth motif discovery. Therefore, the behaviors observed (i.e. no plots) for round 2 & 3 of motif detection is not unexpected when none of metrics computed for the putative motifs are above the thresholds.

In your situation, it indeed makes sense to check the refine plot for the matching reverse complement motifs. More generally, we observed in our study that methylation motifs with low number of occurrences as well as bipartite motifs (i.e. motifs with two specificity regions, like A6mACNNNNNNGTGC in E. coli K-12 MG1655) are more difficult to detect (discussed in our preprint). In practice, generating refine plots from R in interactive mode is helpful but it’s a function currently not documented. We will provide additional instructions of using this refinement module in future update (see nanodisco refine below).

Alternatively, the nanodisco characterize command can be used with the inclusion of all discovered motifs, plus the additional motifs that a user is interested in or hypothesized. The command will generate refine plots and motif centers in its first step and it can be canceled when the message "Classify motif(s) with model" is displayed to avoid the unnecessary typing and fine mapping of the motifs. Please see the example below or the more detailed entry in our tutorial.

nanodisco characterize -p 4 -b <path_difference> -o <path_output> -m <motif1,motif2,..> -t knn -r <path_reference_genome>

Please note that if some motifs were known a priori or were found outside of the nanodisco motif command, it could be useful to run the command once again using the development option, -m <motif1,motif2,..>, directly removing signals explained by those motifs.

Considering that these solutions may not be easiest to use, we will include in the future a command similar to nanodisco characterize so that refine plots can directly be produced with a single command (e.g. nanodisco refine -p <nb_threads> -b <name_output> -d <path_difference> -o <path_output> -M <motif_to_refine> -m <discovered_motifs> -r <path_genome>).

You mentioned planning to generate more fast5s, if it's for the same sample, it will likely help methylation discovery. We often see major improvements when coverage increases over ~100x and it continues to improve as the coverage goes to ~200x (the maximum we tested so far).

Thank you again for the feedback. Please do not hesitate to ask if you have any additional questions and I will try my best to help.

Alan

mloleary-0 commented 3 years ago

Hi Alan,

Thanks for the information and the update. The additional fast5s were for other samples, but still worked great (with v1.0.0, haven't tried v1.0.1 yet). Since you mentioned bipartite motifs, you may be happy to hear nanodisco picked up the expected number of distinct bipartite motifs (2-3) in each of the six strains I analyzed. I'd certainly like to run nanodisco characterize on this data set, but do not have albacore basecalls.

Regarding sequencing depth in your preprint, I am understanding correctly that each level you checked uses the same depth for both native and amplified DNA (e.g, 100x is 100x for both native and amplified samples)? For my data, the native DNA samples range from ~100-300x mean depth, whereas the amplified samples are ~60-80x. I'm not clear if generating additional amplified sequence data would provide a substantial benefit in this case; what do you think?

Thanks, Mike

touala commented 3 years ago

Hi Mike,

We are happy to hear that all the expected bipartite motifs were detected. For motif characterization, although the current classification model was trained on Albacore data, we would like to note that great empirical results were obtained from our recent testing on multiple datasets basecalled with Guppy. So, we would suggest you to give the classification a try, while being critical about the results. You may also want to consolidate the characterization results from the three models (i.e. nn, knn, and rf).

Regarding your question about coverage, your understanding is correct and both native and WGA were used at the same overall coverage in our method evaluation. Although the impact of coverage increase from 60x to >100x is not easy to estimate for a particular genome, adding some more WGA data will generally improve the current difference signal and increase power.

In addition, you may find this helpful: In v1.0.1, we implemented the refine plot command mentioned earlier in our discussion. Now you can generate the plots for all discovered motifs by doing: nanodisco refine -p <nb_threads> -b <name_output> -M all -m <motif1,motif2,...> -d <path_difference> -o <path_output> -r <path_genome>.

Thank you again for your interests in our tool and your helpful feedback. Please feel free to let me know if you have any other questions.

Regards,

Alan