GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

run dataprep errors #38

Closed q1134269149 closed 9 months ago

q1134269149 commented 3 years ago

I re-run nanopolish and start xpore-dataprep, but I got some errors in the log:

_2020-10-31 15:19:05,559 - pyensembl.shell - INFO - Running 'install' for EnsemblRelease(release=99, species='homo_sapiens') 2020-10-31 15:19:06,199 - pyensembl.sequence_data - INFO - Loaded sequence dictionary from /home/shihan/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle 2020-10-31 15:19:06,330 - pyensembl.sequence_data - INFO - Loaded sequence dictionary from /home/shihan/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle 2020-10-31 15:19:06,458 - pyensembl.sequence_data - INFO - Loaded sequence dictionary from /home/shihan/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.pep.all.fa.gz.pickle INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/shihan/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/shihan/.cache/pyensembl/GRCh38/ensembl99/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle Process Consumer-29: Traceback (most recent call last): File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/helper.py", line 110, in run result = self.task_function(next_task_args,self.locks) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/dataprep.py", line 317, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('ENST00000442171', 548) Process Consumer-22: Traceback (most recent call last): File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/helper.py", line 110, in run result = self.task_function(next_task_args,self.locks) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/dataprep.py", line 317, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('ENST00000409020', 1680) Process Consumer-16: Traceback (most recent call last): File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/helper.py", line 110, in run result = self.task_function(next_task_args,self.locks) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/dataprep.py", line 317, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer' KeyError: ('ENST00000333421', 2614) ...... Process Consumer-23: Traceback (most recent call last): File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/multiprocessing/process.py", line 258, in bootstrap self.run() File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/helper.py", line 110, in run result = self.task_function(next_task_args,self.locks) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/xpore-0.5.6-py3.6.egg/xpore/scripts/dataprep.py", line 317, in preprocess_gene genomic_coordinate = list(itemgetter(zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomicposition','kmer' KeyError: ('ENST00000547026', 1824)

In addition, in the output file, I got six files:

data.index, data.json, data.log, data.readcount, eventalign.hdf5, eventalign.log

And the tail of eventalign.log file are: 30f80d2d-d599-4d85-aa14-a19f3d50b929 fe2e9cb9-d733-4754-bc25-6994b87b42a3 e0dfb0bf-6329-45aa-a470-4739201bd487 a39a48e9-a266-4b51-8dfa-c9ef0427dcd9 bf7cee54-b558-4fd8-aad5-6c9c64c98f00 69235923-8073-4a16-98be-7821be6754e7 d5f7f1b4-b372-4bbd-b3c6-34adb4297ca4 9578dbd7-6699-4960-a187-4dcdff60af23 e8a029cb-e317-4897-ac88-896d77c9dcc7 --- SUCCESSFULLY FINISHED ---

May I ask if this will allow us to continue the next step of xpore-diffmod? Thanks hqin

q1134269149 commented 3 years ago

Moreover, when the xpore appears errors, I find that it does not automatically stop and exit the program, but manually kill all task, otherwise it will remain stuck in the program task. I wonder if this can be improved in subsequent versions. Thanks hqin

ploy-np commented 3 years ago

Hi @q1134269149,

Sorry for your inconvenience. We plan to handle these errors in the later release as soon as possible.

For the error, could you please check if you use the same release in the transcriptome alignment process. I guess t2g_mapping contains such transcript ID but not the position.

xpore-diffmod will use data.json and data.index files. You can run some successful genes, it will have an error again in the end because data.json does not have a proper EOF.

q1134269149 commented 3 years ago

Thank you for your reply. I plan to run my _arabidopsis thaliana_data using xpore, however, when I use pyensembl install the reference with code "pyensembl install --release 48 --species arabidopsis_thaliana", I got the error as following: "_Traceback (most recent call last): File "/home/shihan/anaconda3/envs/nanopolish/bin/pyensembl", line 33, in sys.exit(load_entry_point('pyensembl==1.9.0', 'console_scripts', 'pyensembl')()) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/shell.py", line 245, in run genomes = collect_selected_genomes(args) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/shell.py", line 229, in collect_selected_genomes return all_combinations_of_ensembl_genomes(args) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/shell.py", line 170, in all_combinations_of_ensembl_genomes ensembl_release = EnsemblRelease(version, species=species) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/ensembl_release.py", line 74, in init release=release, species=species, server=server) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/ensembl_release.py", line 41, in normalize_init_values release = check_release_number(release) File "/home/shihan/anaconda3/envs/nanopolish/lib/python3.6/site-packages/pyensembl/ensembl_release_versions.py", line 31, in check_release_number release, MIN_ENSEMBL_RELEASE, MAX_ENSEMBLRELEASE)) ValueError: Invalid Ensembl releases 48, must be between 54 and 100"

Is there any way to solve this? Thanks

ploy-np commented 3 years ago

Hi @q1134269149, We are testing a new feature that can accept a gtf file from users. So, you can use any version of alignment. But this feature will come in the next version. I'll keep you posted.

q1134269149 commented 3 years ago

Thanks hqin