adnaniazi / tailfindr

An R package for estimating poly(A)-tail lengths in Oxford Nanopore RNA and DNA reads.
https://www.cbu.uib.no/valen/
GNU General Public License v3.0
53 stars 18 forks source link

Tailfinder failure: Analyzing a single Fast5 file to assess if your data is in an acceptable format... #25

Closed Malabady closed 2 years ago

Malabady commented 2 years ago

Hi,

I am running the mop_tail workflow, which uses TailFinder. The run failed with the error posted below. I used the mop_preprocessing, which uses Guppy to rebasecall. The fast5 files given to Tailfinder are processed by the standalone Guppy, yet I am getting the following error.

Pipeline BIOCORE@CRG Master of Pore 2 modification module's execution summary
        ---------------------------
        Completed at: 2022-03-23T11:28:10.329494-04:00
        Duration    : 1d 17h 53m 52s
        Success     : false
        workDir     : /scratch/malabady/PitcherGenome/ONT-RNA/MOP2/mop_tail/work
        exit status : 1
        Error report: Error executing process > 'TAILFINDR_ESTIMATE_TAIL:estimateTailSize (fast5_pass)'

Caused by:
  Process `TAILFINDR_ESTIMATE_TAIL:estimateTailSize (fast5_pass)` terminated with an error exit status (1)

Command executed:

  R --vanilla --slave -e "library(tailfindr); find_tails(fast5_dir = './' , save_dir = './', , csv_filename = 'fast5_pass_findr.csv', num_cores = 1)"
  gzip *_findr.csv

Command exit status:
  1

Command output:
  ────────────────────────────────────────────────────────────────────────────────
  ── Started tailfindr (version 0.1.0) ───────────────────────────────────────────
  ────────────────────────────────────────────────────────────────────────────────
  ☰ You have configured tailfindr as following:
  ❯ fast5_dir:         ./
  ❯ save_dir:          ./
  ❯ csv_filename:      fast5_pass_findr.csv
  ❯ num_cores:         1
  ❯ basecall_group:    Basecall_1D_000
  ❯ save_plots:        FALSE
  ❯ plot_debug_traces: FALSE
  ❯ plotting_library:  rbokeh
  ── Processing started at 2022-03-23 11:27:18 ───────────────────────────────────
  • Searching for all Fast5 files...
    Done! Found 1 Fast5 files.
  • Analyzing a single Fast5 file to assess if your data
    is in an acceptable format...
    ✖ Fatal error! Your data has been basecalled with MinKNOW
      live basecalling which currently does not save the
      Events/Move table in the Analyses/Basecall_1D_000 section of
      the FAST5 file. You should rebasecall your FAST5 files using
      standalone Guppy or Albacore, and then use tailfindr on the
      rebasecalled files. Please adjust the value of basecall_group
      parameter in such a case, so that tailfindr can find the
      Events/Move table in the specified basecall_group. You can
      check which basecall_group the Event/Move is residing by
      opening your FAST5 file in HDFView.

      If the Events/Move is present in the data and you have
      specified the correct basecall_group, but you still
      get this error then please open an issue on GitHub:
      https://github.com/adnaniazi/tailfindr/issues
      Remember to attach a few (around 5) of your FAST5 files
      to help us understand the issue.
  ── Processing ended at 2022-03-23 11:28:09 ─────────────────────────────────────
  ✖ tailfindr finished unsuccessfully!
  [1] 0

Any suggestions what went wrong?

Much appreciated.

adnaniazi commented 2 years ago

Hi,

It seems like you live-basecalled your data during sequencing. Tailfindr cannot work on live-basecalled data.

Please basecall your FAST5 file again with Guppy, and produce a new set of basecalled FAST5 files. Then use tailfindr on these newly basecalled FAST5 files and remember to now specify basecall_group = 'Basecall_1D_000' in the tailfindr command.

Best, Adnan

Malabady commented 2 years ago

Hi Adnan,

I did rebasecall using Guppy in mop-preprocessing workflow (https://biocorecrg.github.io/MOP2/docs/mop_preprocess.html). Is there a way to check the fast5 files?

Thanks, Magdy

adnaniazi commented 2 years ago

Yes, you check your rebasecalled file in HDFview software (https://www.hdfgroup.org/downloads/hdfview/). In HDFView, you can check if the these rebasecalled files have a Basecall_1D_001 group.

Malabady commented 2 years ago

Hi Adnan, they seem to have the Baseball_1D_001 group, see the attached image

image image

adnaniazi commented 2 years ago

tailfindr should have worked then. Can you email me (adnaniazi[AT]gmail.com) one of these basecalled files. I need to check it and run tailfindr at my end to debug the issue.

Malabady commented 2 years ago

Sure. These files are large and won't go through regular emails. Can you use Globus?

adnaniazi commented 2 years ago

I dont know what Globus is but you can also use wetransfer.com to freely send the large file. I need only one of these big files.

Malabady commented 2 years ago

ok. I found that we have a access to SENDFILE, which allows large files. I emailed you a ~ 400 MB fast5 file. You should receive the email shortly. I had to change the extension from fast5 to h5 to view the file on HDFview before sending. so, you can change it back to fast5 if needed. thanks for the help.

adnaniazi commented 2 years ago

Your data seems to be working fine at my end. Can you please send me the latest error that you get after running tailfindr with basecall_group == 'Basecall_1D_001'

Malabady commented 2 years ago

Hi Adnan, I also got it tailfindr to work on my data here. the run is ongoing since earlier today. it is working on all fast5 files (576), see below. Could you take a look on my command and tell me if It is sufficient or if I need to add any parameters?

``

df <- find_tails(fast5_dir = './fast5_pass/', save_dir = './out-tailfinder/', csv_filename = 'rna_tails.csv', num_cores = 24, basecall_group = 'Basecall_1D_001', save_plots = TRUE, plotting_library = 'rbokeh') ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ── Started tailfindr (version 1.3) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ☰ You have configured tailfindr as following: fast5_dir: ./fast5_pass/ save_dir: ./out-tailfinder/ csv_filename: rna_tails.csv num_cores: 24 basecall_group: Basecall_1D_001 save_plots: TRUE plot_debug_traces: FALSE plotting_library: rbokeh ── Processing started at 2022-03-24 08:21:20 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── • Creating a sub-directory to save the plots in. Done! All plots will be saved in the following direcotry: ./out-tailfinder//plots • Searching for all Fast5 files... Done! Found 576 Fast5 files. • Analyzing a single Fast5 file to assess if your data is in an acceptable format... ✓ The data has been basecalled using Guppy. ✓ Flipflop model was used during basecalling. ✓ The reads are packed in multi-fast5 file(s). ✓ The experiment type is RNA, so we will search for poly(A) tails. ✓ The reads are 1D reads. • Starting a parallel compute cluster... Done! • Discovering reads in the 576 multifast5 files... ``

adnaniazi commented 2 years ago

Seems fine but you have set save_plots to true. For 576 multifast5 files thats going to take a alot of time, and all the 576*4000 plots will be saved in a single folder. Your OS will hang up when you attempt to open this folder. It is therefore recommended on run tailfindr with save_plots option only on a small subset of data just for debugging purposes.