fanglab / nanodisco

nanodisco: a toolbox for discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiomes using nanopore sequencing.
Other
66 stars 7 forks source link

Nanodisco difference incomplete chunk analysis #32

Closed wentski closed 1 year ago

wentski commented 2 years ago

Hi, When running nanodisco difference on my files it is failing to complete processing for some of the chunks. The "stdout" file is showing that the analysis is proceeding to the "removing outliers" phase but it doesn't seem to progress beyond this and doesn't generate a difference.rds file. As this means that the large temporary files produced are never removed, this results in a build-up, eventually filling all available storage on the hard drive and halting the analysis. If I only run a single chunk which has previously failed to process, it produces all of the temporary files and treats the job as done, then the command ends without the actual output .rds file. The command I am running is as follows:

nanodisco difference -nj 4 -nc 1 -p 12 -i analysis/preprocessed_subset -o analysis/difference_subset -w DH5 -n DH5_Sal_BREX -r reference/DH5_reference_genome.fasta

As I say, it seems to process some chunks and not others (maybe around 2%`). I originally thought that this might be a coverage issue as some of the regions involved seemed to be around the end of the reference genome, but some of the other sites implicated seem to have good coverage.

The analysis itself seems to run fine and correctly picks out modified motifs, so this isn't a game-breaking issue, as such. The main issue is more with taking up many Gb of storage to the point that it halts the analysis. I then have to go through and manually delete the temporary files and start it again from where it stopped.

I'm not sure if there is something that I am doing wrong or if there is anything that can be done to change this. Any advice or help appreciated.

Thanks

touala commented 2 years ago

Hi @wentski,

Thank you for this useful feedback, and I'm glad you can get modified motifs out of nanodisco anyway.

I'm puzzled by your main issue. I must have analyzed close to 100 samples but I've never failed to generate the chunk differences. The most common issue is running out of memory when the coverage is too high for the default setup I'm using to run jobs (I'm using IBM's LSF at our HPC instead of the -nj 1 option). From the output, you see that the analysis is freezing at the "removing outliers" phase and the next chunk is never processed, right? Could you share a subset of log files generated from the commands (stdout and stderr)?

I understand that the problematic chunks cannot be processed with -nj 1 -f <chunk_idx> -l <chunk_idx> either? Sometimes an empty .rds file is generated (< 50 K) when no data is left after the various steps. It can be because no reads are mapped in native or WGA but a failed step could result in those too. Is this something you observed? Would it be possible to visualize the read mapping for one of the problematic chunk with IGV looking for uneven or high coverage.

Regarding the temporary files accumulation issue, I usually run something like find ./nanodisco_output_dir/ -name 'tmp*' to find and remove them after a run with an out-of-memory failure.

Regards,

Alan

wentski commented 2 years ago

Hi Alan,

Thanks for the reply. I can't see anything consistent in the coverage on IGV. The chunks which fail seem to incorporate low coverage regions, high coverage regions and mixed coverage regions. I suspect it isn't a memory issue either as when running a single of the problematic chunks on it's own seems to encounter the same problem in the same way as running jobs in parallel. And even then, the job still seems to "finish" as long as it doesn't run out of storage space. I have attached an example of one of the stdout files. I have noticed that some seem to stop on "removing outliers" while others seem to stop on "Normalising". The stderr files in all cases seem to be completely empty.

As I say, for the chunks that do process nanodisco seems to be able to pull out the correct motif quite well. I would like to be able to run this analysis overnight but as it is I need to run ten/twenty or so chunks at a time and then repeat the process of removing failed temp files and restarting every few hours.

Hope this helps, Sam

stdout.txt

touala commented 2 years ago

Thank you Sam. I don't see an obvious explanation/solution but I want to sort this out. Do you think you could share the fast5 and reference for a dataset with those issue at alan.tourancheau [at] bio.ens.psl.eu?

Alan