GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
131 stars 23 forks source link

KeyError: 'contig' when running xpore-dataprep #122

Closed vmurigneu closed 2 years ago

vmurigneu commented 2 years ago

Dear xpore team,

We are trying to run xpore v2.1 using ONT dRNA data from a non-reference species. The same xpore dataprep command ran successfully on our test dataset (3110 reads) but failed on our control dataset (313,762 reads).

Here is the command used for running xpore data prep:

threads=4
    xpore-dataprep \
    --eventalign ${out_dir}/${prefix}.eventalign.txt \
    --summary ${out_dir}/${prefix}.summary.txt \
    --out_dir dataprep \
    --n_processes ${threads}

Here is the content of the data prep folder for the successful dataset:

-rw-r--r-- 1 uqvmurig users 176298807 Jan  7 11:54 eventalign.hdf5
-rw-r--r-- 1 uqvmurig users    112399 Jan  7 11:54 eventalign.log
-rw-r--r-- 1 uqvmurig users 978012184 Jan  7 17:50 data.json
-rw-r--r-- 1 uqvmurig users       465 Jan  7 17:50 data.index
-rw-r--r-- 1 uqvmurig users       298 Jan  7 17:50 data.readcount
-rw-r--r-- 1 uqvmurig users       605 Jan  7 17:50 data.log

head dRNA.summary.txt

read_index  read_name   fast5_path  model_name  strand  num_events  num_steps   num_skips   num_stays   total_duration  shift   scale   drift   var
1   ecf2faf6-2276-40db-b3bc-11f2692225f6    xpore/data/dRNA/fast5/FAO13853_pass_1cc9b748_56.fast5       template    2160    1042    27  1090    13.32   -1.932  0.948   0.000   1.297
0   b187a7d0-0e94-407b-acac-c27115276a5a    xpore/data/dRNA/fast5/FAO13853_pass_1cc9b748_56.fast5       template    2192    1040    30  1121    13.82   -4.763  0.964   0.000   1.321
2   f5839f16-4f7c-4580-9314-cd0c589524bb    xpore/data/dRNA/fast5/FAO13853_pass_1cc9b748_265.fast5      template    2178    1037    28  1112    13.31   -0.480  0.876   0.000   1.395

head dRNA.eventalign.txt

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
EvRCC1521_s239_g11944_i1    22  TGGCC   1   t   86  100.86  1.864   0.00266 TGGCC   103.76  5.31    -0.48   57480   57488
EvRCC1521_s239_g11944_i1    22  TGGCC   1   t   87  97.45   3.034   0.00232 TGGCC   103.76  5.31    -1.04   57473   57480
EvRCC1521_s239_g11944_i1    23  GGCCT   1   t   88  105.02  5.167   0.00896 GGCCT   105.48  3.29    -0.12   57446   57473

Here is the content of the data prep folder for the failed control dataset:

-rw-r--r-- 1 uqvmurig users 463105 Jan  7 19:08 eventalign.hdf5
-rw-r--r-- 1 uqvmurig users    222 Jan  7 19:08 eventalign.log

The eventalign.log file only contains 6 read ID

head control.summary.txt

read_index  read_name   fast5_path  model_name  strand  num_events  num_steps   num_skips   num_stays   total_duration  shift   scale   drift   var
2   d8433da3-a369-4922-af73-67a86a32696b    xpore/data/Icontrol/fast5/FAR28319_pass_45896bd2_48.fast5       template    905 419 15  470 5.38    0.083   0.913   0.0001.217
1   11545548-fc42-4561-a9bb-555c98f12047    xpore/data/control/fast5/FAR28319_pass_45896bd2_67.fast5        template    3657    1032    29  2595    24.13   10.540  0.960   0.0001.356
3   2c1714c7-f3af-4875-931f-89474b8d31f4    xpore/data/control/fast5/FAR28319_pass_45896bd2_4.fast5     template    2882    832 29  2020    20.48   -1.869  0.908   0.0001.217

head control.eventalign.txt

contig  position    reference_kmer  read_index  strand  event_index event_level_mean    event_stdv  event_length    model_kmer  model_mean  model_stdv  standardized_level  start_idx   end_idx
EvRCC1521_s239_g11944_i1    0   TAATA   2   t   3982    112.02  1.316   0.00232 TAATA   108.46  3.11    1.04    25487   25494
EvRCC1521_s239_g11944_i1    0   TAATA   2   t   3983    109.39  2.542   0.00299 TAATA   108.46  3.11    0.27    25478   25487
EvRCC1521_s239_g11944_i1    0   TAATA   2   t   3984    111.37  1.698   0.01062 TAATA   108.46  3.11    0.85    25446   25478

The error message is:

Warning: duplicate read name bdb73d26-2eac-4f2e-9990-472934440653 found in fasta file
[readdb] indexing /scratch/90days/uqvmurig/EPT/xpore/data/IVT_control/fast5
[readdb] num reads: 363721, num reads with path to fast5: 363721
[post-run summary] total reads: 313735, unparseable: 0, qc fail: 16094, could not calibrate: 4598, no alignment: 600, bad fast5: 0
Process Consumer-1:
Traceback (most recent call last):
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'contig'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/xpore/scripts/helper.py", line 110, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/xpore/scripts/dataprep.py", line 62, in combine
    eventalign_result['transcript_id'] = eventalign_result['contig']
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'contig'
Process Consumer-4:
Traceback (most recent call last):
  File "/home/uqvmurig/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'contig'
=>> PBS: job killed: walltime 288016 exceeded limit 288000

Do you have any ideas of the causes of the issue and how to fix it ?

Thanks, Valentine

yuukiiwa commented 2 years ago

Hi Valentine @vmurigneu,

Thank you for reporting the error! Do you mind double-checking whether you are using xpore version 2.1? Since version 1.0, xpore dataprep no longer outputs eventalign.hdf5, and the current xpore dataprep command is xpore dataprep instead of xpore-dataprep. Thank you!

Best wishes, Yuk Kei

vmurigneu commented 2 years ago

Hi Yuk,

Thanks for your help. It worked when I used the command xpore dataprep instead of the command xpore-dataprep. Both commands were accessible as I previously had xpore 2.0 installed. I was using the command xpore-dataprep following this page of the tutorial that have not been updated yet: https://xpore.readthedocs.io/en/latest/cmd.html.

Best wishes Valentine

yuukiiwa commented 2 years ago

Hi Valentine,

Apologies that we didn't update the command usage (we will update that if is not updated).

We did update all the commands in the quick start, where if you have xpore 2.0 or above installed, you should be able to check the version with xpore -v (screenshot attached below): Screenshot 2022-01-18 at 9 07 26 AM

Thanks!

Best wishes, Yuk Kei