GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

error due to missing start_idx and end_idx headers in eventalign #41

Closed callumparr closed 3 years ago

callumparr commented 3 years ago

I ran xpore on my own data running into issue.

  1. I map my reads to the mouse transcriptome which is actually merging the cdna and ncrna together with minimap2 map-ont.
  2. Then I convert sam to bam and create bam index
  3. Creating the nanopolish (v0.13.2:latest) index as follows

nanopolish index -d /path/to/raw/fast5 /path/to/fastq

This created the index and index.readdb as well as the .fai etc files.

  1. Then creating eventalign:

nanopolish eventalign --reads /path/to/fastq --bam /path/to/alignments --scale-events --summary output.txt --threads 12 > eventalign.txt

This creates the eventalign.txt and summary.txt as expected but with the following headers

contig  position        reference_kmer  read_index      strand  event_index     event_level_mean        event_stdv      event_length    model_kmer      model_mean      model_stdv      standardized_level
ENSMUST00000103626.2    33      TGCAG   1       t       29      102.35  3.503   0.00299 TGCAG   103.32  3.86    -0.23
ENSMUST00000103626.2    34      GCAGC   1       t       30      92.43   2.793   0.00232 GCAGC   89.55   5.09    0.51
ENSMUST00000103626.2    35      CAGCT   1       t       31      108.28  2.371   0.00498 CAGCT   112.05  3.02    -1.13
ENSMUST00000103626.2    35      CAGCT   1       t       32      112.81  3.627   0.00299 CAGCT   112.05  3.02    0.23
ENSMUST00000103626.2    35      CAGCT   1       t       33      108.11  2.921   0.00432 CAGCT   112.05  3.02    -1.18
ENSMUST00000103626.2    35      CAGCT   1       t       34      113.79  2.445   0.00299 CAGCT   112.05  3.02    0.52
ENSMUST00000103626.2    35      CAGCT   1       t       35      108.33  4.065   0.00465 CAGCT   112.05  3.02    -1.11
ENSMUST00000103626.2    36      AGCTG   1       t       36      117.91  2.631   0.00266 AGCTG   117.44  3.55    0.12

At this point, I realised it is missing the start_idx and end_idx headers that is in the demo data eventalign file.

  1. In any case ran the xpore-dataprep as follows:
xpore-dataprep --eventalign Mouse_aging/nanopolish/Day2_03_pass_eventalign.txt --summary Mouse_aging/nanopolish/Day2_03_DRS_summary.txt --out_dir Mouse_aging/nanopolish

In this case I ommit --genome even though it is written in the documentation to add as I believe this is only required when you have mapped to genome. I was unsure about this as both nanopolish and some documentation here says that direct RNA has to be mapped to transcriptome at the moment. What is the function of the --flag in this case? Is xpore also suitable for gDNA reads?

Running the xpore-dataprep produces the following error

Process Consumer-1:
Traceback (most recent call last):
  File "/home/callum/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'end_idx'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/local/pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/callum/.local/lib/python3.7/site-packages/xpore/scripts/helper.py", line 110, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/callum/.local/lib/python3.7/site-packages/xpore/scripts/dataprep.py", line 47, in combine
    eventalign_result['length'] = pd.to_numeric(eventalign_result['end_idx'])-pd.to_numeric(eventalign_result['start_idx'])
  File "/home/callum/.local/lib/python3.7/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/callum/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'end_idx'

I also have to kill the process as it does not stop by itself.

I think this issue is related to this one https://github.com/tleonardi/nanocompore/issues/153 in this case they advised to added --samples flag when preparing the data with nanopolish so as to output all the necessary headers in the eventalign dataset.

I will try rerunning nanopolish eventalign with --sample fag.

callumparr commented 3 years ago

Sorry looking through nanopolish eventalign --helpseems that should be --signal-index

callumparr commented 3 years ago

Ah OK I see there is an issue referencing the same thing:

https://github.com/GoekeLab/xpore/issues/37