Error with xpore-dataprep

lingolingolin commented 4 years ago

I ran into some error when running xpore-dataprep

Command: xpore-dataprep --eventalign wt1.fastq.evn.aln.tsv --summary wt1.fastq.evn.aln.summary.txt --out_dir wt1_xpore_data_prep

Error: Process Consumer-1: Traceback (most recent call last): File "/nfs/no_backup_isis/enovoa/analysis/hliu/software/XPORE/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'contig'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/users/enovoa/hliu/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/nfs/no_backup_isis/enovoa/analysis/hliu/software/XPORE/lib/python3.7/site-packages/xpore/scripts/helper.py", line 77, in run result = self.task_function(*next_task_args,self.locks) File "/nfs/no_backup_isis/enovoa/analysis/hliu/software/XPORE/lib/python3.7/site-packages/xpore/scripts/dataprep.py", line 60, in combine eventalign_result['transcript_id'] = [contig.split('.')[0] for contig in eventalign_result['contig']] File "/nfs/no_backup_isis/enovoa/analysis/hliu/software/XPORE/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/nfs/no_backup_isis/enovoa/analysis/hliu/software/XPORE/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'contig'

Input `==> wt1.fastq.evn.aln.tsv <== contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx chr10 8403 CTATA 1 t 1054 76.79 0.889 0.00232 NNNNN 0.00 0.00 inf 32 39

==> wt1.fastq.evn.aln.summary.txt <== read_index read_name fast5_path model_name strand num_events num_steps num_skips num_stays total_duration shift scale drift var 1 8c8524ed-fc79-4deb-aafc-d83922ca0b2d wt1/batch_6.fast5 template 1054 171 15 867 6.67 9.819 0.887 0.000 1.797 `

I mapped DRS reads to non-human genome. I wonder if this is what gave me the error? Thanks for developing this tool and thanks in advance for your help.

ploy-np commented 4 years ago

I think so. I fixed this and updated the software to version 0.5.1 already. Thanks to the post #17. What I did was adding --species in xpore-dataprep that you can specify your RNA species, which is used by Ensembl. Could you please try and let me know if there is any other error? Thank you!

lingolingolin commented 4 years ago

Hi @ploy-np , Thanks a lot for your reply.

Does the species have to have an ensemble annotation? I am working on a species with custom annotation.

Also, do you think mapping to reference genome (minimap2 splicing mode) instead of reference transcriptome will affect xpore?

ploy-np commented 4 years ago

For this version, yes the species have to be known in the ensemble annotation. I will consider this in the next release then. The thing is Nanopolish eventalign requires reference transcriptome to assign signal segments to transcriptomic sequence. So, at the moment, I recommend to run xpore-dataprep without --genome so the model will run on transcriptomic coordinates instead.

lingolingolin commented 4 years ago

Hi @ploy-np , Thanks very much for clarifying my doubts. I am using a genome with very few slicing events and i treated it as transcriptome when ran xpore. :-) I will wait for your update and try it out again. Thanks again for your answers and for developing xpore.

ploy-np commented 4 years ago

Hi @lingolingolin , Another thing that you can do is to extract the transcriptome sequence from the genome sequence when the gtf is available. Then you can run nanopolish eventalign after aligning to the transcriptome. You can do this in R, but I also saw there are tools that you can use, for example: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread. And you can then you can use xpore-dataprep without --genome.

Hope this helps!

lingolingolin commented 4 years ago

Hi @ploy-np , Thanks a lot. That was what i did to generate transcriptome. I still got the same (pandas associated) error message -- KeyError: 'contig. However, when i extract eventalign results to single genes and process it with xpore-dataprep, it runs flawlessly. So i guess there must be something else that caused this error. I am working on it. btw, sometimes it gives another pandas associated error - 'KeyError: 'reference_kmer'', depends on whether i am processing multiple transcripts or single one. It seems to implicate my input data is not well-formatted? but i did double check it and did not find anything wrong so far.

ploy-np commented 4 years ago

Can you give me the full command that you used? And also if it is possible, could you provide the snapshot of the nanopolish eventalign output with the header please?

lingolingolin commented 4 years ago

Hi @ploy-np, The commands i used: xpore-dataprep --eventalign transcript1.ko1.aln.tsv --summary ko1_dataprep/ko1.evn.aln.summary.txt --outdir singletranscript_dataprep_ko1

and eventalign output

Let me know if you need more details.

Thanks a lot.

ploy-np commented 4 years ago

Is YDL_248W_mRNA a transcript? So, when you have multiple transcripts, you have unique id for each, right? Can you give some example names of those transcripts please? If you'd like to have a short call to discuss about this error quickly, I'm also happy to do this. I can send you my zoom meeting. You can let me know your email then. Thanks!

ploy-np commented 4 years ago

Do the transcript names have .?

lingolingolin commented 4 years ago

Hi @ploy-np , thanks a lot. My email is lingolingo0lin@gmail.com. I will be very happy to have a chat with you to fix the problem.

mparker2 commented 4 years ago

Hi @ploy-np, I've noticed a similar problem with Arabidopsis transcripts at the moment - the naming convention is e.g. AT1G01010.1 where AT1G01010 is the gene name and 1 is the transcript id. xPore assumes the transcript id is an ensembl version number and trims it off... I think the assumptions around ensembl naming schemes are going to break with a lot of other organisms...

lingolingolin commented 4 years ago

Hi @mparker2 , In my case, the inconsistent order of read index in the summary and that in the eventalign files from Nanopolish output may have caused the problem.

ploy-np commented 4 years ago

Hi @lingolingolin and @mparker2, I've mad a new release, xpore-0.5.2 to address these bugs. Hopefully, you can now run xpore on Arabidopsis transcripts using the transcriptome mode in xpore-dataprep. Let me know if you still have problems.

Thank you for your contributions!

lingolingolin commented 4 years ago

Thanks @ploy-np for your effort 👍

ploy-np commented 4 years ago

Hi @lingolingolin ,

I've added a feature in xpore-dataprep that users can use their own GTF file to map transcriptome to genome. It's in 'gtf' branch. I'm finding a dataset to test it before I merge with the next release. But it would be very appreciated if you can test this new feature on your own dataset with the genome mode in xpore-dataprep.

Thank you very much in advance!

GoekeLab / xpore

Error with xpore-dataprep #18