adnaniazi / tailfindr

An R package for estimating poly(A)-tail lengths in Oxford Nanopore RNA and DNA reads.
https://www.cbu.uib.no/valen/
GNU General Public License v3.0
48 stars 15 forks source link

SQK-PCB111.24 have all invalid reads #43

Closed MustafaElshani closed 1 year ago

MustafaElshani commented 1 year ago

Dear @adnaniazi

Recently I have moved to use the SQK-PCB111.24 kit and after basecalling with

ont-guppy/bin/guppy_basecaller --config dna_r9.4.1_450bps_sup_prom.cfg \
--input_path /path/to/raw/fast5/ \
-r \
--save_path /path/to/basecalled/ \
--barcode_kits SQK-PCB111-24 \
--device cuda:0 \
--min_qscore 7 \
--calib_detect \
--chunk_size 1536 \
--chunks_per_runner 512 \
--num_callers 4 \
--gpu_runners_per_device 12 \
--fast5_out \
--trim_strategy none 

as per information here which states that PCB111 has the following flanking sequences

5' - ATCGCCTACCGTGA - barcode - TTGCCTGTCGCTCTATCTTC - 3'
5' - ATCGCCTACCGTGA - barcode - TCTGTTGGTGCTGATATTGC - 3'

I decided to run tailfindr with the following

library(tailfindr)
df <- find_tails(fast5_dir = './',
save_dir = './',
csv_filename = 'tails.csv',
dna_datatype = 'custom-cdna',
front_primer = "TCTGTTGGTGCTGATATTGC",
end_primer = "TTGCCTGTCGCTCTATCTTC",
num_cores = 24)

I have tried various combinations including the default to no avail the majority 98% are invalid FALSE tails

Can you see where the issue can be?

adnaniazi commented 1 year ago

Hi,

Can you try switching the front and end primer sequences in the command that you used and re-run tailfindr. So like this:

library(tailfindr) df <- find_tails(fast5_dir = './', save_dir = './', csv_filename = 'tails.csv', dna_datatype = 'custom-cdna', end_primer = "TCTGTTGGTGCTGATATTGC", front_primer = "TTGCCTGTCGCTCTATCTTC", num_cores = 24)

Front and end primer refer to FP and EP segments in this picture. cdna_construct

Best, Adnan

MustafaElshani commented 1 year ago

Hi

Thank you for your prompt reply. I have tried both to no avail almost all remain INVALID & FALSE. I have also tried different version of the VBZ just in case that was causing a problem.

I have tested my environment with another kit I used before and I get TRUE reads

I sequenced these in promethion flow cells on the new P2 SOLO, I dont assume this has anything to do with the issue.

I did not basecall using MinKNOW which I thought was appropriate as I could do tailfindr on those fast5 files however those files were missing both of the basecall_group. I continued with basecalling with Guppy v6.4.2+97a7f06 and the fast5_out which added the default Basecall_1D_000 group.

The QC is good with N1.1, PCB111.24 is identical to PCS111 so tails should be there.

Any further advise would be appreciated

Mustafa

adnaniazi commented 1 year ago

Make a fork of the tailfindr repo (master branch), and then edit the find-dna-tailtype.R file in your forked repo. You will find it in the R folder of the forked repo.

In line 128 and 129 of this file, substitute your front and end primer sequences in place of the sequences that are already there. Save and commit the file to your forked repo. Then install tailfindr from your forked repo.

Once installed, then run tailfindr like this: library(tailfindr) df <- find_tails(fast5_dir = './', save_dir = './', csv_filename = 'tails.csv', dna_datatype = 'cdna', num_cores = 24)

With these changes, tailfindr will now search for the front and end primer in longer search windows compared to previously. This may increase the chance of finding the primers.

If this does not help then perhaps your front and end primers are too small and don't have that much discriminative power between them in the presence of Nanopore base calling errors.

MustafaElshani commented 1 year ago

I tried your suggestion same result.

The flanking sequences of PCB111.24 are the same as PCB109;

5' - ATCGCCTACCGTGAC - barcode - ACTTGCCTGTCGCTCTATCTTC - 3'
5' - ATCGCCTACCGTGAC - barcode - TTTCTGTTGGTGCTGATATTGC - 3'

The only difference here is that AC and TT at the 5' which have been removed in the ONT PCB111 documentation, but these are present in all of the PCB111 barcodes so I think they should remain in for PCB111.24. The tailfindr still failed to find any valid true tails in this particular fast5.

When I run tailfindr with the above flanking sequences on a sample which I prepared with PCB109 a long time ago tails were found, I know this kit doesn't attach at the end of the tail and gives alot of false positive but found the tails nonetheless.

I tried the compress_fast5 from ont-fast5-api just incase, again no tails.

This is driving me crazy I'm suspecting something fishy with guppy?

MustafaElshani commented 1 year ago

Hi @adnaniazi

As I suspected guppy was the issue when i reverted back to using 6.0.6+8a98bbc and basecalled the same 'fast5' files with the exact same parameters I finally got tails. I have no idea what the change would have been from 6.0.6 to 6.4.2 to cause such a drastic change I did realise that they are deprecating the fast5_out maybe it's something to do with that.

So after I got this working I tried to see what gave me the most TRUE tails.

I used the above flanking sequences and I got the following

when running tailfindr on custom-cdna and providing sequences I get the following. fpACTT_epTTTC TRUE tails = 771

fpTTTC_epACTT TRUE tails = 1855

when running tailfindr on default and entering sequences in the find-dna-tailtype.R I get the following

when fpACTT_epTTTC TRUE tails = 1406

when fpTTTC_epACTT TRUE tails = 3144

Hence I have couple questions 1) Will it be wise in thinking that the default was correct as it detected more tails than custom-cdna? 2) From the ONT documentation it looks as if ACTT.. is the 'fp' primer. However when enter this as 'fp' it gives me a lower TRUE tails then when i enter TTTC... Just can't seem to orientate myself. Should I take the highest number of TRUE tails from tailfindr is closer to the truth?

Your help is appreciated

Mustafa

adnaniazi commented 1 year ago

If used the protocol shown in the figure below: Untitled then just used the default settings of tailfindr (my original tailfindr, not your forked one). This is because tailfindr should work out of the box for protocols such as SQK-PCS111 and its barcoding version SQK-PCB111.24 without you having to specify front and end primers.

So here is how I would like you to proceed:

  1. Install tailfindr again from the master branch of my repository
  2. Use the basecalled data from 6.0.6+8a98bbc
  3. Use tailfindr like this: df <- find_tails(fast5_dir = './', save_dir = './', csv_filename = 'tails.csv', num_cores = 24)

Hope this helps.

Best, Adnan

MustafaElshani commented 1 year ago

The plot thickens indeed! Not only is the v6.0.6 is probably the last version compatible it only works with the defaults --chunk-size, --chunks_per_runner, '--num_callers` settings.

I optimised those parameters with RTX3090 GPU which worked but same parameters didn't work well with RTX8000, so had to use default.

Version after after 'v6.0.6' didn't work with neither of the GPU including default settings.

...and yes your default tailfindr worked fine with the PCB111. It has been round about way to this issue but it seems guppy is doing strange things!

Finally happy and will now proceed with the analysis and see how many days I need to wait to tailfind 12000 fast5.

Thank you for your help Mustafa