GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
108 stars 19 forks source link

m6anet dataprep error #112

Open eltonjrv opened 1 year ago

eltonjrv commented 1 year ago

Dear developers,

I've already successfully run the full m6anet pipeline (version 2.0.1) on a given dRNA-Seq dataset, but now am experiencing the following error during the dataprep step on a different dRNA-seq dataset.

START of error message

Process Consumer-2: Traceback (most recent call last): File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: nan

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/m6anet/utils/helper.py", line 85, in run result = self.task_function(*next_task_args,self.locks) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/m6anet/utils/dataprep_utils.py", line 205, in index pos_end += eventalign_result.loc[_index]['line_length'].sum() File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 1418, in getitem return self._getitem_tuple(key) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 805, in _getitem_tuple return self._getitem_lowerdim(tup) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 961, in _getitem_lowerdim return getattr(section, self.name)[new_key] File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 1418, in getitem return self._getitem_tuple(key) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 805, in _getitem_tuple return self._getitem_lowerdim(tup) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 929, in _getitem_lowerdim section = self._getitem_axis(key, axis=i) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 1850, in _getitem_axis return self._get_label(key, axis=axis) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexing.py", line 160, in _get_label return self.obj._xs(label, axis=axis) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/generic.py", line 3729, in xs return self[key] File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/frame.py", line 2995, in getitem indexer = self.columns.get_loc(key) File "/nobackup/fbsev/bioinformatics-tools/miniconda3/envs/drna-m6anet/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: nan

END of error message

It successfully generated the eventalign.index file before printing the the log above. On the previous steps of the pipeline, nanopolish ran well, like with the first successful dataset, without any apparent error. I'd appreciate if someone could shed a light.

Many thanks in advance, Elton

chrishendra93 commented 1 year ago

hi Elton,

Seems like there is a nan entry in your data somewhere and hence the key error, can you print the first few lines of your eventalign.txt file, the command that you used to run nanopolish eventalign, and maybe provide me with your eventalign.index if it is possible?

Thanks!

eltonjrv commented 1 year ago

Hi Chris, Thanks for your prompt reply, and sorry for my late response. Long weekend over here in the UK. Below are the first 10 lines from the nanopolish-generated eventalign.txt file: contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx LINF_010006300-T1 2 GAAGA 1 t 246 98.85 2.903 0.00232 GAAGA 105.36 4.06 -1.46 7289 7296 LINF_010006300-T1 2 GAAGA 1 t 247 103.32 4.256 0.00598 GAAGA 105.36 4.06 -0.46 7271 7289 LINF_010006300-T1 2 GAAGA 1 t 248 110.35 3.562 0.00232 GAAGA 105.36 4.06 1.12 7264 7271 LINF_010006300-T1 2 GAAGA 1 t 249 103.73 4.772 0.01062 GAAGA 105.36 4.06 -0.36 7232 7264 LINF_010006300-T1 3 AAGAT 1 t 250 129.72 6.098 0.00963 AAGAT 124.17 5.87 0.86 7203 7232 LINF_010006300-T1 4 AGATC 1 t 251 137.23 6.729 0.00332 AGATC 134.08 5.10 0.56 7193 7203 LINF_010006300-T1 4 AGATC 1 t 252 125.54 12.031 0.00365 AGATC 134.08 5.10 -1.52 7182 7193 LINF_010006300-T1 5 GATCA 1 t 253 88.64 4.127 0.00465 GATCA 93.45 5.70 -0.77 7168 7182 LINF_010006300-T1 6 ATCAC 1 t 254 77.66 1.720 0.00498 ATCAC 78.57 2.30 -0.36 7153 7168

And attached is the gzipped eventalign.index eventalign.index.gz

I did a grep -i 'NA' on both files and nothing was retrieved.

Hope you can spot the problem. Thanks again

chrishendra93 commented 1 year ago

hi @eltonjrv, really sorry for my late reply - I have been quite occupied the past few weeks with work. I am not sure yet what's causing this error since the indexing step should be quite straight forward, I think the only way we can troubleshoot this is by printing the transcript_id and read_index that might be causing the error. The error message seems to really indicate that there is a null value in the eventalign.txt. Are you able to install this from github instead and maybe add a try except before line 205 to print the error message just to check?

def index(eventalign_result: pd.DataFrame, pos_start: int, out_paths: Dict, locks: Dict):
  r'''
  Function to index the position of a specific read features within eventalign.txt

          Args:
                  eventalign_result (pd.DataFrame): A pd.DataFrame object containing a portion of eventalign.txt to be indexed
                  pos_start (int): An index position within eventalign.txt that corresponds to the start of the eventalign_result portion within eventalign.txt file
                  out_paths (Dict): A dictionary containing filepath for all the output files produced by the index function
                  locks (Dict): A lock object from multiprocessing library that ensures only one process write to the output file at any given time

          Returns:
                  None
  '''
  eventalign_result = eventalign_result.set_index(['contig','read_index'])
  pos_end=pos_start
  with locks['index'], open(out_paths['index'],'a', encoding='utf-8') as f_index:
      for _index in list(dict.fromkeys(eventalign_result.index)):
          transcript_id,read_index = _index
          try:
               pos_end += eventalign_result.loc[_index]['line_length'].sum()
          except Exception:
               raise ValueError(_index)
          f_index.write('%s,%d,%d,%d\n' %(transcript_id,read_index,pos_start,pos_end))
          pos_start = pos_end