Closed ploy-np closed 11 months ago
Change the starting and ending of the data.json file because we don't need it! Other file formats should be considered.
Updated: Major changes are the following.
Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.
Current steps:
For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000305885.2 65 GGAGC 527492 t 3 114.39 8.741 0.01328 GGAGC 121.19 5.69 -1.04 100036 100076
ENST00000305885.2 65 GGAGC 527492 t 4 122.54 6.191 0.00266 GGAGC 121.19 5.69 0.21 100028 100036
ENST00000305885.2 65 GGAGC 527492 t 5 112.07 8.895 0.00564 GGAGC 121.19 5.69 -1.39 100011 100028
ENST00000305885.2 65 GGAGC 527492 t 6 118.68 3.960 0.00232 GGAGC 121.19 5.69 -0.38 100004 100011
ENST00000305885.2 65 GGAGC 527492 t 7 120.74 5.917 0.00266 GGAGC 121.19 5.69 -0.07 99996 100004
ENST00000305885.2 65 GGAGC 527492 t 8 126.58 7.295 0.00631 GGAGC 121.19 5.69 0.82 99977 99996
ENST00000305885.2 66 GAGCA 527492 t 9 105.75 7.531 0.01162 GAGCA 107.01 3.02 -0.36 99942 99977
ENST00000305885.2 66 GAGCA 527492 t 10 114.03 2.819 0.00299 GAGCA 107.01 3.02 2.02 99933 99942
ENST00000305885.2 66 GAGCA 527492 t 11 100.41 6.246 0.00199 GAGCA 107.01 3.02 -1.90 99927 99933
parallel_combine
which generates and loads tasks to a task queue and then combine
will process it in parallel. Each task is for one read. combine
function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby
manner.eventalign.log
and eventalign.hdf5
which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']
Issues
Create a .json file, where the info of all reads are stored per genomic position, for modelling.
parallel_preprocess' which generates and loads tasks to a task queue and then
preprocess' will process it in parallel.data.log
, data.json
, data.index
, and data.readcount
.{
<gene_id>: {
<genomic_position>: {
<kmer>: array of .2f
}
}
}
Issues
Discussion
Updated: Major changes are the following.
Current steps:
For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx
ENST00000305885.2 65 GGAGC 527492 t 3 114.39 8.741 0.01328 GGAGC 121.19 5.69 -1.04 100036 100076
ENST00000305885.2 65 GGAGC 527492 t 4 122.54 6.191 0.00266 GGAGC 121.19 5.69 0.21 100028 100036
ENST00000305885.2 65 GGAGC 527492 t 5 112.07 8.895 0.00564 GGAGC 121.19 5.69 -1.39 100011 100028
ENST00000305885.2 65 GGAGC 527492 t 6 118.68 3.960 0.00232 GGAGC 121.19 5.69 -0.38 100004 100011
ENST00000305885.2 65 GGAGC 527492 t 7 120.74 5.917 0.00266 GGAGC 121.19 5.69 -0.07 99996 100004
ENST00000305885.2 65 GGAGC 527492 t 8 126.58 7.295 0.00631 GGAGC 121.19 5.69 0.82 99977 99996
ENST00000305885.2 66 GAGCA 527492 t 9 105.75 7.531 0.01162 GAGCA 107.01 3.02 -0.36 99942 99977
ENST00000305885.2 66 GAGCA 527492 t 10 114.03 2.819 0.00299 GAGCA 107.01 3.02 2.02 99933 99942
ENST00000305885.2 66 GAGCA 527492 t 11 100.41 6.246 0.00199 GAGCA 107.01 3.02 -1.90 99927 99933
parallel_combine
which generates and loads tasks to a task queue and then combine
will process it in parallel. Each task is for one read. combine
function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using a groupby
manner.eventalign.log
and eventalign.hdf5
which is in the format below.
[<transcript_id>][<read_id>]['events'] = numpy structured array where the fields are ['read_id', 'transcript_id', 'transcriptomic_position', 'reference_kmer', 'norm_mean']
Issues
Create a .json file, where the info of all reads are stored per genomic position, for modelling.
parallel_preprocess' which generates and loads tasks to a task queue and then
preprocess' will process it in parallel.data.log
, data.json
, data.index
, and data.readcount
.{
<gene_id>: {
<genomic_position>: {
<kmer>: array of .2f
}
}
}
readcount_min
reads (default = 1000) per gene.Discussion
Goal: Convert the output from nanopolish-eventalign which is in a read-wise format in transcriptomic coordinates to a position-wise format in genomic coordinates.
Current steps:
For each read, we need to combine those multiple (signal) events aligned to the same positions, the results from nanopolish-eventalign, into a single event per position.
parallel_combine
which generates and loads tasks to a task queue and thencombine
will process it in parallel. Each task is for one read.combine
function, the eventalign information of each read is loaded into a pandas dataframe, where the segments of the same positions are combined using agroupby
manner.eventalign.log
andeventalign.hdf5
which is in the format below.Issues
Generate read count from the bamtx file.
read_count.csv
which contains the number of reads per transcript and mapstranscript_id
tochr, gene_id, gene_name
Create a .json file, where the info of all reads are stored per genomic position, for modelling.
parallel_preprocess' which generates and loads tasks to a task queue and then
preprocess' will process it in parallel.data.log
,data.json
, anddata.index
.Issues
Discussion