GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
103 stars 19 forks source link

Error while trying to run the m6anet-dataprep #1

Closed bhargava-morampalli closed 3 years ago

bhargava-morampalli commented 3 years ago

Hello,

I ran the nanopolish to get the files necessary to run m6anet on my data. I ran the following script.

m6anet-dataprep --eventalign sample1.eventalign.txt \ --summary sample1_summary.txt \ --out_dir /scratch/user/data/m6anet/dataprep --n_processes 4

I get the following error.

Traceback (most recent call last): File "/home/bhargavam/anaconda3/bin/m6anet-dataprep", line 11, in <module> load_entry_point('m6anet==0.0.1', 'console_scripts', 'm6anet-dataprep')() File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/dataprep.py", line 245, in main File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 67, in is_successful File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 60, in read_last_line OSError: [Errno 22] Invalid argument

Any insights on what the problem might be and how to resolve this are much appreciated. Thank you very much.

chrishendra93 commented 3 years ago

Hi yes, I think I will work to fix this since this behavior is quite annoying on my end too, but the error is because you already have an empty eventalign.log file on the output directory before running the dataprep function. Removing the .log file should fix this

bhargava-morampalli commented 3 years ago

Thank you for a quick response. Deleting that file removed the error. But, a new error pops up - I have attached below.

`Process Consumer-1: Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 50, in run result = self.task_function(*next_task_args,self.locks) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/dataprep.py", line 36, in combine eventalign_result['length'] = pd.to_numeric(eventalign_result['end_idx']) - pd.to_numeric(eventalign_result['start_idx']) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx' Process Consumer-2: Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 50, in run result = self.task_function(*next_task_args,self.locks) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/dataprep.py", line 36, in combine eventalign_result['length'] = pd.to_numeric(eventalign_result['end_idx']) - pd.to_numeric(eventalign_result['start_idx']) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx' Process Consumer-3: Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 50, in run result = self.task_function(*next_task_args,self.locks) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/dataprep.py", line 36, in combine eventalign_result['length'] = pd.to_numeric(eventalign_result['end_idx']) - pd.to_numeric(eventalign_result['start_idx']) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx' Process Consumer-4: Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/bhargavam/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/helper.py", line 50, in run result = self.task_function(*next_task_args,self.locks) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/m6anet-0.0.1-py3.7.egg/m6anet/scripts/dataprep.py", line 36, in combine eventalign_result['length'] = pd.to_numeric(eventalign_result['end_idx']) - pd.to_numeric(eventalign_result['start_idx']) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/home/bhargavam/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'end_idx'`

I am not sure what's causing this. Please let me know if you have any suggestions. I appreciate you responding as I know you might be busy. Thanks.

chrishendra93 commented 3 years ago

It seems that your eventalign file does not have the end_idx and start_idx column. Can you show me the first few rows of your sample1.eventalign.txt along with the column names? Thanks

bhargava-morampalli commented 3 years ago
Screen Shot 2020-09-29 at 1 31 16 PM

Here you go. Thanks.

bhargava-morampalli commented 3 years ago

nanopolish index -d fast5/ reads/reads.fasta minimap2 -ax map-ont -t 8 reference/genomicassembly.fa reads/reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp samtools index reads.sorted.bam nanopolish eventalign \ --reads reads/reads.fasta \ --bam reads.sorted.bam \ --genome reference/genomicassembly.fa \ --summary reads_summary.txt \ --scale-events > reads.eventalign.txt These are the list of commands I used for generating the eventalign file and the summary file For the reference, I used the genome assembly file I normally use for aligning reads - I think this is the one they were referring to in the nanopolish documentation. I hope I am right.

chrishendra93 commented 3 years ago

Okay I think the problem is that you should supply --signal-index to your nanopolish eventalign command. This will produce the start_idx and end_idx column that is needed for m6Anet. So you should run

nanopolish eventalign --reads reads/reads.fasta --bam reads.sorted.bam --genome reference/genomicassembly.fa --summary reads_summary.txt --signal_index --scale-events > reads.eventalign.txt

bhargava-morampalli commented 3 years ago

Thanks Chris. That eliminated the error but another error popped up now. I am attaching the error here and the first few lines of the eventalign file. I understand if you don't have time to respond. This is just so that you are aware of the problem in getting the m6anet running by other people. If I can resolve this, I will write up a detailed guide to get everything running and indicating the errors.

Screen Shot 2020-09-30 at 5 08 19 PM

and these are the first few lines from eventalign output.

Screen Shot 2020-09-30 at 5 07 47 PM

Cheers.

chrishendra93 commented 3 years ago

Hi Bhargava, sorry for the delay in response

I really appreciate you going through with this and thank you for raising this issue. I will try what I can to supply a more detailed documentation on how to run the program including the Nanopolish eventalign step.

On your problem, the error seems to suggest that your eventalign file does not have the 'contig' column which is clearly there in your printout. Can I check with you a couple of things:

Are there any eventalign.hdf5 file in the output folder that is produced? And if so can you check if it is empty or if there are some entries in the hdf output?

Also, can I trouble you with printing some error message from the code? I have added a temporary branch for this repo called m6anet_error_msg. Can you checkout to that branch and run the code again from that specific branch, it will print out the data that is being preprocessed when the error happens so that I can get a clearer idea on why this happened in the first place

Thanks a lot!

Liverworks commented 3 years ago

Hello,

I was trying to run m6anet on the test data and got errors like this:

Process Consumer-2: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.8/dist-packages/m6anet-0.0.1-py3.8.egg/m6anet/scripts/helper.py", line 50, in run result = self.task_function(*next_task_args,self.locks) File "/usr/local/lib/python3.8/dist-packages/m6anet-0.0.1-py3.8.egg/m6anet/scripts/dataprep.py", line 69, in combine df2write = df_events_per_read.loc[[tx_id,read_id],:].reset_index() File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 873, in getitem return self._getitem_tuple(key) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple return self._getitem_lowerdim(tup) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 766, in _getitem_lowerdim return self._getitem_nested_tuple(tup) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 847, in _getitem_nested_tuple obj = getattr(obj, self.name)._getitem_axis(key, axis=axis) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis return self._getitem_iterable(key, axis=axis) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1240, in _get_listlike_indexer indexer, keyarr = ax._convert_listlike_indexer(key) File "/home/anna/.local/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 2397, in _convert_listlike_indexer raise KeyError(f"{keyarr[mask]} not in index") KeyError: "['decfc830-9eb6-4a10-97fb-d117db06178d'] not in index"

The same errors were produced on other data. What can be the problem? Thank you very much.

jorgcalis commented 3 years ago

Looking forward to using m6Anet. However, I'm experiencing the same problem when trying to run m6anet-dataprep on the test data, or on my own eventalign output files. Did you find a solution for the "['firstReadID'] not in index " problem?

@chrishendra93 Maybe the problem arises here: https://github.com/GoekeLab/m6anet/blob/3171e2186ac7007e6a2dd7cfe25afcd160646864/m6anet/scripts/dataprep.py#L32 This rules out reads aligned to the minus strand? Maybe thats why some reads are missing?

Thanks!

chrishendra93 commented 3 years ago

hi @Liverworks and @jorgcalis , we are currently working on solving some issues with dataprep. Hopefully latest by the end of next week we can release an updated version of the dataprep and m6anet with faster preprocessing step and better model. The issue with the read index seems to be associated with specific version of pandas being installed in the machine. The next iteration of dataprep should fix this issue all together

Also to clarify, @jorgcalis, that particular line filters out positions where the mapped 5-mer in that segment does not match the predicted 5-mer identity based on the current level, or at least that is based on what I understand about Nanopolish eventalign. We believe that filtering this out will help to improve the signals quality

jorgcalis commented 3 years ago

Hi @chrishendra93, thanks for looking into it.

I noticed that I had to add round brackets around tx_id,read_id in the combine function in dataprep.py, like you also had done before. Maybe not all your changes were kept into the m6anet version I downloaded.

I see that you want to compare the reference and model kmers, but I think that you are filtering out minus aligned reads that way. I solved this by mapping to a transcriptome reference and selecting only (+) aligned reads.

Finally, in the inference step I had to adjust this line to not error when all positions in the data loader batch are covered by a single read only: https://github.com/GoekeLab/m6anet/blob/3171e2186ac7007e6a2dd7cfe25afcd160646864/m6anet/utils/model.py#L28 When that happened the indices were stored as lists instead of arrays.

I got m6anet to run with these modifications, and the results look promising. Looking forward to the better model.

Best, Jorg

bhargava-morampalli commented 3 years ago

Hi @chrishendra93 , I was wondering about your progress regarding the update to m6anet. Just following up and excited to try it ASAP 👍 :) Thanks. -Bhargava

chrishendra93 commented 3 years ago

hi @jorgcalis @bhargava-morampalli and @Liverworks , I have created a pre-release, and the master branch has a different model and preprocessing step. I have also updated the quickstart with the necessary instruction on downloading and running the model on the demo data. Let me know if it works

bhargava-morampalli commented 3 years ago

@chrishendra93 Thanks for the heads up. I will try it out. I am re-installing it again but at the end of installation - the message reads - Finished processing dependencies for m6anet==0.0.1 Where as the version on the release page says - 0.1.0. Just making sure I am using the latest version. Sorry for the bother.

chrishendra93 commented 3 years ago

Hi @bhargava-morampalli , if you are installing from master branch then you're running the most updated version. That was just a typo in the setup.py file

chrishendra93 commented 3 years ago

hi @bhargava-morampalli, if there is no more issue from your end, I will close this issue. If you still encounter any problem do let me know by raising another issue

thanks

Christopher