How to run find_tails function with multiplexed libraries

adnaniazi / tailfindr

An R package for estimating poly(A)-tail lengths in Oxford Nanopore RNA and DNA reads.

https://www.cbu.uib.no/valen/

GNU General Public License v3.0

53 stars 16 forks source link

How to run find_tails function with multiplexed libraries #50

Open KevinXu264 opened 1 year ago

KevinXu264 commented 1 year ago

Hi,

I was trying to follow the instructions to run this chunk of code

library(tailfindr)
df <- find_tails(fast5_dir = '/path/to/basecalled_data',  save_dir` = '/path/to/save/directory/', 
csv_filename = 'rna_tails.csv', num_cores = 2)

But how do I run this after I called guppy with demultiplexing on? The fast5's aren't split by barcode in the workspace directory generated by guppy, and I want to be able to distinguish my reads by barcodes before calculating tails using the find_tails() function

KevinXu264 commented 1 year ago

So instead of splitting reads before finding tails, I decided to just find_tails() for all fast5's, then in the annotation step, to annotate to each barcode's alignment file. However, when doing the initial find_tails() function, I get an output csv file that is blank for all the columns except the read_id and the file_path, and I'm not sure why that is the case

adnaniazi commented 1 year ago

Hi,

Are you running it on RNA or cDNA data. What is the command that you are using.

Can you run tailfindr on package data itself to see if it works on that data. Like this: library(tailfindr) df <- find_tails(fast5_dir = system.file('extdata', 'rna', package = 'tailfindr'), save_dir = '~/Downloads', csv_filename = 'rna_tails.csv', num_cores = 2)

Best, Adnan

KevinXu264 commented 1 year ago

cDNA data.

The find tails command is pretty generic, I've ran through your workflow with your test data sets and they've worked before.

I thought it might be an issue with my guppy basecalled step? But I made sure within the output fast5 files that they had a Basecall_1D_000 folder with the move/event file inside it.

This was my guppy command

/home/guillaume-chanfreau/Downloads/ont-guppy-gpu-6.2.1/bin/guppy_basecaller \
--config dna_r9.4.1_450bps_hac.cfg \
--input_path ~/Desktop/kxtailfindr/fast5 \
--recursive \
--save_path ~/Desktop/kxtailfindr/basecalled_data \
--fast5_out \
--trim_strategy none \
--barcode_kits SQK-PCB111-24 \
--device auto \
--gpu_runners_per_device 1 \
2>&1 | tee logfile.txt

I tested find_tails() with a single fast5 and it still gives me the same error.

I've attached a fast5 file along with the rna_tails.csv output that I get

FAV21706_f11495bd_27cb0b98_960.zip rna_tails.csv

adnaniazi commented 1 year ago

Hi,

Sorry for the late reply. I just analyzed your FAST5 file and tailfindr seems to be working for it. It seems you have problem installing the VBZ plugin correctly for your system. Which operating system are your running tailfindr on?

KevinXu264 commented 1 year ago

I've run tailfindr on both Mac and Linux systems and they work with the sample RNA fast5 data.

Were you able to get the start, end, and length calculations with my FAST5 data? I reran it and had the same issue as https://github.com/adnaniazi/tailfindr/issues/51#issue-1716641949

Giving me invalid read_type. The research kit was SQK-PCB111.24

adnaniazi commented 1 year ago

Hi,

Yes, I was able to get start, end and length calculations for your FAST5 file. SQK-PCB111.24 seems like a barcoded version of the SQK-PCS111 kit so I think tailfindr should work on it -- and it does infact work.

The RNA sample data included in tailfindr is very old. It does not have VBZ compression so that's why tailfindr works fine for it. Your Fast5 file has raw data compressed in VBZ format and to read it, tailfindr needs a properly installed VBZ plugin. Just installing the VBZ from installer on a Mac, or extracting the tar file on Linux is not enough.

You need to set the HDF5_PLUGIN_PATH for VBZ plugin to be discoverable for any downstream software.

Extract the tar file of VBZ plugin and then do like this (on Linux): export HDF5_PLUGIN_PATH=/bla/bla/bla_path/vbz/ont-vbz-hdf-plugin-1.0.1-Linux/usr/local/hdf5/lib/plugin

KevinXu264 commented 1 year ago

Got it, I still am unable to get it to work.

So the plugin file of vbz that I extract on Linux is supposed to be called libvbz_hdf_plugin.so?

I set the HDF5_PLUGIN_PATH environment variable to my plugin path, but the majority of my reads are still being called as INVALID.

Here's my csv tails output

rna_tails.csv

Also, I checked the plugin path in Rstudio with a package called rhdf5filters and ran the command rhd5filters::hdf5_plugin_path and it points to a separate location than the one I exported using the command you gave me. If I call find_tails inside Rstudio, will it look in that location instead of the one I exported in the command line?

adnaniazi commented 1 year ago

You have got vbz plugin working if you are getting some tail predictions. The reason so many of your reads are invalid is because tailfindr is unable to find the adapters next to the polyA tail with high-enough confidence. There can be multiple reasons for it: Either the adapters are not there or are partially there, or have very high sequencing errors.

stegiopast commented 11 months ago

Hey everyone,

I am facing a similar problem with cDNA. Is there any update from your side @KevinXu264 ? For samples with dRNA tailfindr is running well with cDNA I have similar issues as described here.

Kind regards, Stefan Pastore