Empty outputs from dataprep

matthew-valentine commented 4 years ago

I have been trying to run dataprep on some human data. I am using xpore version 0.5.4 installed using pip. The problem I am having is that all the files with the data. prefix are empty apart from headers. The eventalign. files are fine, with the eventalign.log showing all the read IDs from the summary file.

I am running the process on a qsub server, and the process is apparently still running, though it has been going for over two weeks now so I'm assuming it is actually just stalled and unable to continue? The error and output files from my qsub submission are both empty so it is difficult to know what is happening. I have run the process on the demo data from your quickstart guide and it finished successfully, so maybe there is something wrong with my data/input files. Any idea of how I can start troubleshooting this?

Thanks, Matthew

ploy-np commented 3 years ago

Hi @matthew-valentine,

That version has bugs. Could you please try xpore 0.5.6?

Apologise for the unstable release.

Best, Ploy

matthew-valentine commented 3 years ago

I installed xpore 0.5.6 and when I run it I start to get reads in the data. files. I know encounter a different error though. It will run, process a handful of genes (maybe 50-60) and then stop processing any more. It won't stop running, it's just nothing else will get added to the files. When I look at the error file I can see it encounters an error, but then doesn't stop. If I then start it back up again using the --resume option I can get another 50-60 genes processed before the same thing happens again.

I've attached an image of the error file. Perhaps you can help me figure out why this is happening for some genes and not others, so I can then get it to stop and process the whole file? I can see it is a problem with getting the genomic coordinates from the t2g_mapping dictionary for those specific IDs. Thinking about it, the default Ensembl version for xpore is v91 right? I was using v99 for my mapping so maybe that is the problem. I'll give it a go specifying that and see if that clears it all up.

xpore-dataprep-error

ploy-np commented 3 years ago

Hi @matthew-valentine,

Sorry that I haven't handled the errors in this version, but this would be certainly considered for later versions.

At the moment, as you suspected, it would be because of the wrong version of the t2g_mapping dictionary. Could you try to give the ensembl version through --ensembl?

matthew-valentine commented 3 years ago

I gave it a go specifying Ensembl version 99 to match what I used for the mapping. Now data-rep completes "successfully" but finds 0 genes, so I'm not exactly sure what is going wrong.

matthew-valentine commented 3 years ago

I realised I hadn't installed Ensembl 99 using pyensembl, and when I did that it started to run properly. However I am then hit with the same error as before, where it complains about the genomic_coordinate step and stops doing anything (while continuing to run).

ploy-np commented 3 years ago

Hi @matthew-valentine, Could you please post again the full error message please?

matthew-valentine commented 3 years ago

Of course, here is the full error message when running dataprep with the resume option.

INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/matthew/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/matthew/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle Process Consumer-1: Traceback (most recent call last): File "/opt/local/pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/matthew/.local/lib/python3.7/site-packages/xpore-0.5.6-py3.7.egg/xpore/scripts/helper.py", line 110, in run result = self.task_function(next_task_args,self.locks) File "/home/matthew/.local/lib/python3.7/site-packages/xpore-0.5.6-py3.7.egg/xpore/scripts/dataprep.py", line 317, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('ENST00000242784', 879)

matthew-valentine commented 3 years ago

I think I might have got to the root of the problem. There seems to be a problem with the transcript version numbers from my mapping. I thought that the version numbers were stripped away and only the transcript ID is used, but the Ensembl fasta files have the transcript IDs as contigs, complete with version number, so I'm guessing the incorrect version numbers mean they can't be pattern matched correctly. I'm running the mapping again with a fasta with the correct version numbers so hopefully that will do the trick.

ploy-np commented 3 years ago

Thanks a lot @matthew-valentine for digging deep to the problem. Whether to use the version of the transcripts or not is still inconclusive. We are thinking about what is the best way for gene-to-transcript mapping that provides high flexibility for users. If you have any suggestions, please let me know.

GoekeLab / xpore

Empty outputs from dataprep #32