GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
131 stars 23 forks source link

Dataprep issue: empty data files #138

Closed acurry-hyde closed 8 months ago

acurry-hyde commented 2 years ago

Hi - I'm having a problem running dataprep where the eventalign.index appears to run successfully but the data files do not eventalign.index output:

transcript_id,read_index,pos_start,pos_end ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,15,172,97378 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,27,97378,302018 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,31,302018,492171 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|,22,492171,850364

data.index ouput:

idx,start,end

data.json:

N/A - is empty

data.log contains:

Total 0 genes. --- SUCCESSFULLY FINISHED ---

data.readcount contains:

idx,n_reads

I've run dataprep on your test dataset and it runs perfectly and based on other issues I've seen your comments on I'm gathering that there's possibly something wrong with either the fasta of gtf file I'm using, but I'm not 100% certain. Any help would be greatly appreciated. Thanks in advance

head of the eventalign.txt:

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    0   ATTAA   15  t   517 84.98   0.766   0.01096 ATTAA   85.30   2.46    -0.11   25541   25574
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    1   TTAAT   15  t   518 99.08   1.737   0.00266 TTAAT   94.22   3.04    1.33    25533   25541
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    2   TAATC   15  t   519 101.34  1.078   0.00598 TAATC   106.55  3.11    -1.39   25515   25533
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    3   AATCC   15  t   520 110.00  3.603   0.00465 AATCC   103.25  5.70    0.98    25501   25515
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    4   ATCCC   15  t   521 71.35   1.063   0.00498 ATCCC   70.32   2.30    0.37    25486   25501
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    4   ATCCC   15  t   522 72.25   1.337   0.00531 ATCCC   70.32   2.30    0.70    25470   25486
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    5   TCCCC   15  t   523 65.68   1.155   0.00365 TCCCC   64.94   2.07    0.30    25459   25470
ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|    6   CCCCT   15  t   524 62.27   1.499   0.01461 CCCCT   64.58   2.07    -0.93   25415   25459

head of gencode transcript fa:

>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG

head of gencode annotation gtf:

##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2021-09-02
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
yuukiiwa commented 2 years ago

Hi @acurry-hyde,

xpore dataprep does not support the contig being in the following format:

ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|

You will have to change the eventalign contig column to include only the transcript_id ENST00000457540.1 for xpore dataprep to run through.

Thanks!

Best wishes, Yuk Kei

acurry-hyde commented 2 years ago

Thanks for your response @yuukiiwa.

I modified the eventalign contig column as you suggested to include only the transcript_id and then re-ran dataprep but unfortunately it had the same output as before (i.e. all of the data files are empty except for headers). Is it that I should alter the gencode transcripts.fa file to also contain only the transcript_id (ENST) and then redo the alignment and eventalign steps?

Eventalign.txt with altered contig

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
ENST00000457540.1   0   ATTAA   15  t   517 84.98   0.766   0.01096 ATTAA   85.30   2.46    -0.11   25541   25574
ENST00000457540.1   1   TTAAT   15  t   518 99.08   1.737   0.00266 TTAAT   94.22   3.04    1.33    25533   25541
ENST00000457540.1   2   TAATC   15  t   519 101.34  1.078   0.00598 TAATC   106.55  3.11    -1.39   25515   25533
ENST00000457540.1   3   AATCC   15  t   520 110.00  3.603   0.00465 AATCC   103.25  5.70    0.98    25501   25515
ENST00000457540.1   4   ATCCC   15  t   521 71.35   1.063   0.00498 ATCCC   70.32   2.30    0.37    25486   25501
ENST00000457540.1   4   ATCCC   15  t   522 72.25   1.337   0.00531 ATCCC   70.32   2.30    0.70    25470   25486
ENST00000457540.1   5   TCCCC   15  t   523 65.68   1.155   0.00365 TCCCC   64.94   2.07    0.30    25459   25470
ENST00000457540.1   6   CCCCT   15  t   524 62.27   1.499   0.01461 CCCCT   64.58   2.07    -0.93   25415   25459
ENST00000457540.1   6   CCCCT   15  t   525 62.98   1.279   0.00664 CCCCT   64.58   2.07    -0.65   25395   25415

The tail of the job log is below just in case it's at all relevant:

/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
/home/z3360668/.venvs/xpore/lib/python3.8/site-packages/xpore-2.1-py3.8.egg/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)

Many thanks in advance for your help! Ashton

EDIT: the eventalign.index file appears as follows:

transcript_id,read_index,pos_start,pos_end
ENST00000457540.1,15,172,44098
ENST00000457540.1,27,44098,136628
ENST00000457540.1,31,136628,222663
ENST00000457540.1,22,222663,384497
ENST00000416931.1,12,384497,414504
ENST00000457540.1,18,414504,483561
ENST00000424587.7,1,483561,700948
ENST00000635159.1,8,700948,797833
ENST00000457540.1,21,797833,869968

I've noticed that when clicking to highlight the header 'columns' they appear to highlight as individual values (eg. transcript_id or read_index etc (see first screenshot below), where the delimiter is recognised as separating each header value), however the indexed events appear to not recognise the comma delimiter (see second screenshot below)... could this be contributing to the error?

Screen Shot 2022-04-05 at 1 14 00 pm Screen Shot 2022-04-05 at 1 13 46 pm
dywang0323 commented 2 years ago

I used my own data to run the xpore dataprep, the command is like below: xpore dataprep --eventalign /work/schroederrna/methylation/hbecpolya/hbecpolya_event.tsv --out_dir /work/schroederrna/methylation/hbecpolya/test_4

it takes more than 20h, I didn't get any output, the xpore still running, but the error report shows

Traceback (most recent call last): File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'read_index'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/dywang/.local/bin/xpore", line 8, in sys.exit(main()) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 744, in dataprep parallel_index(eventalign_filepath,chunk_size,out_dir,n_processes,resume) File "/home/dywang/.local/lib/python3.9/site-packages/xpore/scripts/dataprep.py", line 60, in parallel_index chunk_complete = chunk[chunk['read_index'] != chunk.iloc[-1]['read_index']] File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3505, in getitem indexer = self.columns.get_loc(key) File "/home/dywang/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'read_index'

can anyone help me with this, thanks a lot!!!

yuukiiwa commented 2 years ago

Hi @dywang0323,

Do you mind showing the first 10 lines of your hbecpolya_event.tsv file, please? head hbecpolya_event.tsv

Thanks!

Best wishes, Yuk Kei