xpore/scripts/xpore-dataprep: steps

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.

The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Generate read count from the bamtx file.
- This is done by `count_reads' (not in parallel because it is fast enough, compared to the other two steps).
- The output of this step is read_count.csv which contains the number of reads per transcript and maps transcript_id to chr, gene_id, gene_name
```
% header
chr, gene_id, gene_name, transcript_id, n_reads
```
- This is necessary because our model (xpore-diffmod) will consider only the genomic positions that have more than certain number of aligned reads. So, we will not pass those genes with lower number of reads to Step 3 for saving the computational time.
Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel.
The output of this step is data.log, data.json, and data.index.

The `data.json' file has the format below.

{
<gene_id>: {
    <genomic_position>: {
        <kmer>:{
            "norm_means": array of float (todo: with 2 decimal points),
            "read_ids": array of string.
        }
    }
}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

If we lower the threshold of read counts, we may have ~10,000 genes.

Discussion

Step 1 and 3 are the bottleneck.
We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.

Updated: Major changes are the following.

Only 2 steps left.
Convert trascriptomic to genomic on the fly.
Users can choose to model on genomic or transcriptomic coordinates.
The format of data.json has been changed.

Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.

The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel.
The output of this step is data.log, data.json, data.index, and data.readcount.

The `data.json' file has the format below.

{
<gene_id>: {
    <genomic_position>: {
        <kmer>: array of .2f
    }
}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.

Issues

Some genes have >1000 reads.

Discussion

Step 2 uses a lot of memory if a gene has many reads -- to measure properly.
We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

Updated: Major changes are the following.

Limit the number of reads. Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.

Current steps:

For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.

The nanopolish-eventalign output looks like

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx 
ENST00000305885.2   65  GGAGC   527492  t   3   114.39  8.741   0.01328 GGAGC   121.19  5.69    -1.04   100036  100076
ENST00000305885.2   65  GGAGC   527492  t   4   122.54  6.191   0.00266 GGAGC   121.19  5.69    0.21    100028  100036
ENST00000305885.2   65  GGAGC   527492  t   5   112.07  8.895   0.00564 GGAGC   121.19  5.69    -1.39   100011  100028
ENST00000305885.2   65  GGAGC   527492  t   6   118.68  3.960   0.00232 GGAGC   121.19  5.69    -0.38   100004  100011
ENST00000305885.2   65  GGAGC   527492  t   7   120.74  5.917   0.00266 GGAGC   121.19  5.69    -0.07   99996   100004
ENST00000305885.2   65  GGAGC   527492  t   8   126.58  7.295   0.00631 GGAGC   121.19  5.69    0.82    99977   99996
ENST00000305885.2   66  GAGCA   527492  t   9   105.75  7.531   0.01162 GAGCA   107.01  3.02    -0.36   99942   99977
ENST00000305885.2   66  GAGCA   527492  t   10  114.03  2.819   0.00299 GAGCA   107.01  3.02    2.02    99933   99942
ENST00000305885.2   66  GAGCA   527492  t   11  100.41  6.246   0.00199 GAGCA   107.01  3.02    -1.90   99927   99933

For example, we want to combine the first 6 lines of GGAGA segments.
This is done by parallel_combine which generates and loads tasks to a task queue and then combine will process it in parallel. Each task is for one read.
In the combine function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby manner.

The output of this step is eventalign.log and eventalign.hdf5 which is in the format below.

[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']

Issues

We normally have 1-3 millions of reads.

Create a .json file, where the info of all reads are stored per genomic position, for modelling.

This is done by parallel_preprocess' which generates and loads tasks to a task queue and thenpreprocess' will process it in parallel.
The output of this step is data.log, data.json, data.index, and data.readcount.

The `data.json' file has the format below.

{
<gene_id>: {
    <genomic_position>: {
        <kmer>: array of .2f
    }
}
}

The `data.index' file contains the index of the start and end of the information for each gene for random access.
We keep only readcount_min reads (default = 1000) per gene.

Discussion

We can change the file formats if it makes the processing time faster and requires less storage.
Do these three steps make sense? Is there any better other way to achieve the goal?

GoekeLab / xpore

xpore/scripts/xpore-dataprep: steps #5