GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
134 stars 22 forks source link

What happens when nanopolish model_kmer is NNNNN? #88

Closed cathoderaymission closed 11 months ago

cathoderaymission commented 3 years ago

Looking at the code for xpore 2.0, I can see the following:

assert list(set(g_kmer_array))[0].count('N') == 0 ##to weed out the mapped kmers from tx_seq that contain 'N', which is not in diffmod's model_kmer

Does this mean any nanopore read which includes an unmapped (not in model_kmer) events for that chromosome/position are discarded? Or do we just discard the mean from the NNNNN event and use the rest?

I ask this because from your demo data I can see that sometimes one of these events happens in the middle of a mapping, and according to the paper the multiple event means are weighted/averaged by their event length.

ENST00000351111.6       689     GCTGA   3       t       1416    98.74   3.721   0.00232 GCTGA   89.96   2.85    2.75    41380   41387
ENST00000351111.6       689     GCTGA   3       t       1417    87.95   2.286   0.00896 GCTGA   89.96   2.85    -0.63   41353   41380
ENST00000351111.6       689     GCTGA   3       t       1418    102.76  4.575   0.00232 NNNNN   0.00    0.00    inf     41346   41353
ENST00000351111.6       689     GCTGA   3       t       1419    89.48   1.892   0.01228 GCTGA   89.96   2.85    -0.15   41309   41346
ENST00000351111.6       689     GCTGA   3       t       1420    86.30   1.539   0.00365 GCTGA   89.96   2.85    -1.15   41298   41309
ENST00000351111.6       689     GCTGA   3       t       1421    89.38   2.461   0.01228 GCTGA   89.96   2.85    -0.18   41261   41298
ENST00000351111.6       689     GCTGA   3       t       1422    88.21   1.945   0.00432 GCTGA   89.96   2.85    -0.55   41248   41261
ENST00000351111.6       689     GCTGA   3       t       1423    85.51   1.369   0.00332 GCTGA   89.96   2.85    -1.39   41238   41248
ENST00000351111.6       689     GCTGA   3       t       1424    87.93   1.238   0.00365 GCTGA   89.96   2.85    -0.64   41227   41238
ENST00000351111.6       689     GCTGA   3       t       1425    90.67   3.229   0.00432 GCTGA   89.96   2.85    0.22    41214   41227

I also noticed that when not using genome mapping, there doesn't seem to be any check for these 'NNNNN' events in the preprocess_tx function.

ploy-np commented 3 years ago

Hi @cathoderaymission, Yes, those Kmers containing 'N' are discarded. Oh, you are right. There is no check for the "NNNNN" Kmers in the transcriptome mode. We will add the 'N' checker in the preprocess_tx soon - Thank you very much.

However, this does not affect the process of xpore diffmod; except that the result table may contain those 'N' kmers, which can be filtered out later.

LuckyMLucy commented 1 year ago

Dear Developer, Could you please tell me the kit used for the cell line? Thanks very much!