GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

xpore-diffmod hangs on error #53

Closed callumparr closed 3 years ago

callumparr commented 3 years ago

Running through qsub. Error happens and then job hangs without exiting. Similar to other issues which cause errors. No further genes processed for one day so killed it. Restarted with -resume but unsure if this also affected the model output. Should the number of models output be the same as the number of genes processed? Some genes don't pass the t-test so in this case they don't output a model?

Process Consumer-1:
Traceback (most recent call last):
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'CCGNC'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/helper.py", line 113, in run
    result = self.task_function(*next_task_args,self.locks)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/diffmod.py", line 43, in execute
    kmer_signal = {'mean':model_kmer.loc[kmer,'model_mean'],'std':model_kmer.loc[kmer,'model_stdv']}
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexing.py", line 873, in __getitem__
    return self._getitem_tuple(key)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexing.py", line 786, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexing.py", line 1110, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexing.py", line 1059, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/generic.py", line 3493, in xs
    loc = self.index.get_loc(key)
  File "/home/callum/miniconda3/envs/xpore/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'CCGNC'
ploy-np commented 3 years ago

Sorry for you inconvenience. The error handling is still in progress.

xpore-diffmod will output all the positions that are differentially modified even slightly.

You are right that some positions of a gene don't pass the t-test, so they will not be in the output. The t-test option is just used to speed up xpore-diffmod, and of course not all differently modified positions are tested, resulting in lower sensitivity. In my paper, I don't use this option to produce the results, but of course the processing time will be longer because there are more positions to be modelled by xPore.

Regarding to your error, can I ask if you run on DNA? Because xPore uses prior information, called model_kmer, saying about the mean and std for each kmer, this information should be consistent with your kmer e.g. 5-mers or 6-mers. The default model_kmer file is stored internally in the package, but this is 5-mer information. In the next version, we will allow users to specify the path to their own model_kmer file in the config file.

callumparr commented 3 years ago

Hey, thanks for the speedy reply.

So I should expect that the no. of hdf5 model files to equal the number of processed genes in the diffmod.log?

It is direct RNA data. It got to about half way through the data set then got this kmer with N.

I tried to resume but same issue. Would there be a way to track down which position and read is causing the issue so I can remove it from data set.

ploy-np commented 3 years ago

Yes, if you use --save_models, you should get the number of .hdf5 files equal to the number of processed genes.

xpore-dataprep doesn't check if there is "N" in the kmer. So, you can remove any invalid kmers from nanopolish output and rerun xpore-dataprep again. If you remove it from data.json, it would require data.index to be modified according, which is difficult. So, I suggest to modify the nanopolish output by removing the entire row containing invalid kmer.

callumparr commented 3 years ago

Sounds good. Although with eventalign file the size of 300Gb that might take a while.

callumparr commented 3 years ago

Here I searched on the reference k-mer in the eventalign.txt file, I think this is correct, and output the line it appears on if it does.

awk -F "\t" '{print $3}' /analysisdata/rawseq/bcl/callum/Mouse_aging/eventalign/Day3_06_DRS/Day3_06_pass_eventalign.txt | fgrep -n "CCGNC"

I cannot find any in the 4 input libraries prepared with data-prep for analysis with diffmod.

Also, nanopolish outputs "NNNNN" in the model k-mer if it thinks the signal is unexpected/weird. I guess this doesn't matter forxpore because there are many entries for this type.

ploy-np commented 3 years ago

xpore-dataprep already removes the "NNNNN" Kmers. But it's strange that you can't find "CCGNC".

callumparr commented 3 years ago

So I know everything worked well when I just ran on one replicate for each condition and not using --genome flag so I will try dataprep step again this time without --genome but more replicates.

I noticed something weird when I ran salmon on some other library where the log output said about removing non ACTG sequences. I could try it on these libraries and see if get similar warning.