GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
136 stars 22 forks source link

xpore/scripts/xpore-dataprep: steps #5

Closed ploy-np closed 11 months ago

ploy-np commented 4 years ago

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

    • The nanopolish-eventalign output looks like
      contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
      ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
      ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
      ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
      ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
      ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
      ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
      ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
      ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
      ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933
    • For example, we want to combine the first 6 lines of GGAGA segments.
    • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
    • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
    • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
      [<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

      Issues

    • We normally have 1-3 millions of reads.
  2. Generate read count from the bamtx file.

    • This is done by `count_reads' (not in parallel because it is fast enough, compared to the other two steps).
    • The output of this step is read_count.csv which contains the number of reads per transcript and maps transcript_id to chr, gene_id, gene_name
      % header
      chr, gene_id, gene_name, transcript_id, n_reads
    • This is necessary because our model (xpore-diffmod) will consider only the genomic positions that have more than certain number of aligned reads. So, we will not pass those genes with lower number of reads to Step 3 for saving the computational time.
  3. Create a .json file, where the info of all reads are stored per genomic position, for modelling.

Issues

Discussion

ploy-np commented 4 years ago

Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.

ploy-np commented 4 years ago

Updated: Major changes are the following.

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

    • The nanopolish-eventalign output looks like
      contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
      ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
      ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
      ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
      ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
      ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
      ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
      ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
      ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
      ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933
    • For example, we want to combine the first 6 lines of GGAGA segments.
    • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
    • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
    • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
      [<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

      Issues

    • We normally have 1-3 millions of reads.
  2. Create a .json file, where the info of all reads are stored per genomic position, for modelling.

Issues

Discussion

ploy-np commented 4 years ago

Updated: Major changes are the following.

Current steps:

  1. For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

    • The nanopolish-eventalign output looks like
      contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
      ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
      ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
      ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
      ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
      ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
      ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
      ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
      ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
      ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933
    • For example, we want to combine the first 6 lines of GGAGA segments.
    • This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
    • In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.
    • The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.
      [<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

      Issues

    • We normally have 1-3 millions of reads.
  2. Create a .json file, where the info of all reads are stored per genomic position, for modelling.

Discussion