LRB-IIMCB / ninetails

An R package for finding non-adenosine residues in poly(A) tails of ONT direct RNA sequencing reads
Other
6 stars 3 forks source link

non-A residues, sequins #10

Open ana-mil opened 6 months ago

ana-mil commented 6 months ago

Hello, we have used NineTails on a direct RNA run of k562 samples that contains sequins, in which we mapped the SEQUINS only. We would expect all SEQUINS to have only A tails, but we get non-A residues in 13% of all the reads that have a “PASS”, and we would expect this to be 0%. Could you please help us interpret it to understand why this is happening?

The input, the command used to run nineTails (with the docker image) and the output are in the following public folder: https://public-docs.crg.es/enovoa/public/amilovanovic/NineTails

The raw data used is publicly available (https://figshare.com/articles/dataset/k562_sequins_dRNA_albacore-2_1_3_tar_gz/5797287), and was stripped from the basecalls and rebasecalled with guppy 6.0.6.

Thanks in advance for your help and time!

Best regards,

Ana

nemitheasura commented 5 months ago

@ana-mil, sorry for late reply. I was a bit overloaded. Thank you for providing me some of the outputs.

There are a few things for you to consider:

  1. Non-As within poly(A) are not uncommon even in the laboratory-made molecules (both the results of IVT and tailing with PAP). We tested it on various experimental context.

  2. Ninetails relies on external segmentation provided by other software. Nanopolish and tailfindr sometimes can erroneously delimit poly(A) tail, especially in low-complexity regions. If this happens, Ninetails inherits the error.

  3. Ninetails has some constraints which prevent it from reporting non-A in case if the signal is too ambiguous. This was introduced to prevent it from reporting too many false-positives. Even though we aimed to minimize the errors as much as possible, sometimes it makes mistakes.

Also, according to our tests, in case of PAP-tailed molecules ~5% of reads were decorated with non-As even if the reaction mixture for PAP-tailing step contained ATP exclusively. It is due to the trace amount of other nucleotides from the previous experimental step (IVT). We confirmed this with the manual inspection of the signals. This results are included in the paper which was accepted in Nat Comm and soon will be publicly available.

In such a scenario like this, I would always recommend to screen some signals. Ninetails has built-in visualization functions. You might focus only on the tail region, using plot_tail_range(). For example, I took some of your data and plotted it. For ilustratory purpose, I will show only a glimpse.

Below is read no 35330e24-f598-4681-b027-8deb45d5d02c from your 20171122_0612_K562Sequins2 subset. This read was marked by Ninetails as decorated one, which means that the pipeline reported non-A ocurrence. It is read corresponding to your sequin.

plot_read_test_novoa_lab_001

There tail is marked in orange. There are visible signal distortions indicating the presence of non-As there. One could argue whether the tail/adapter border was delimited correctly. However this is clearly visible, that within the A stretch there are most likely G and C.

*Unfortunately I was unable to run the pipeline on my own, since the sequin sequence is unavailable under the provided link (the domain is for sale), so I can not map the sequins to the reference. Furthermore, you provided me class_data twice (your nonadenosine_residues is in fact the class_data file; they should not be identical), so I am unable to check whether Ninetails misclassified the data or not.

Also, please keep in mind, that our classifier was tested & optimized on SQK-RNA002 chemistry, so it might produce less reliable results on SQK-RNA001.

In case of further questions, please @ me.

Thank you for using Ninetails.

Best, N.

ana-mil commented 5 months ago

Hello,

Thank you very much for your reply. I have updated the nonadenosine_residues file, it is correct now (apologies for the mix up), but the reference could be found in https://public-docs.crg.es/enovoa/public/amilovanovic/NineTails/sequins.fa.

Please note that the sequins are not produced via PAP but rather the pA tail is genomically encoded, the sequins are in vitro transcribed with polyA sequence already built in.

Best,

Ana

nemitheasura commented 5 months ago

Hi, @ana-mil, thanks for update. After RNA meeting, I will have a look on that.

Regarding your data - we tested various approaches, PAP-tailing as well as IVT with various enzymes/rNTP concentrations. Everytime we found a fraction of reads containing non-As. Manual inspection of signals revealed that it has to be the experimental artifact, not the neural network mistake. We described all of this in detail in our paper which is soon to be published in Nat Comm. However, if you would like to discuss further, I am open, but not here, since our data are unpublished. Please @ me.

Best, N.