KeyError running human data

Hello,

I am running cmonkey2 on human data. I downloaded the RSAT files for Homo_sapiens_GRCh38 and use them with the –rsat_dir, --rsat_features and –rsat_organism option. I am using a ratio file with protein expression data with uniprot ids. My String file has also uniprot ids. I downloaded the protein_coding.tab, the protein_coding_names.tab and I added the ensembl transcript to uniprot id translations to the protein_coding_names.tab.

In my RSAT directory are the following files: organism.tab feature_names.tab (based on protein_coding_names.tab) feature.tab (based on protein_coding.tab) all contig files

I get the following error:

python /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/bin/cmonkey2 \ /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/NCI60_Ratio_log2_uniprot.txt \ --organism hsa \ --string /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt \ --rsat_organism Homo_sapiens_GRCh38 \ --rsat_dir /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/RSAT \ --rsat_features feature \ --nooperons \ --out ./Output_NCI60_17

2018-03-24 22:18:25 INFO checking MEME... 2018-03-24 22:18:26 INFO Input matrix has # rows: 3171, # columns: 59 2018-03-24 22:18:26 INFO # clusters/row: 2 2018-03-24 22:18:26 INFO # clusters/column: 211 2018-03-24 22:18:26 INFO # CLUSTERS: 317 2018-03-24 22:18:26 INFO use operons: 0 2018-03-24 22:18:26 INFO using MEME version 4.10.2 2018-03-24 22:18:28 INFO using RSAT files for 'Homo_sapiens_GRCh38' 2018-03-24 22:18:28 INFO attempting automatic download of operons from Microbes Online 2018-03-24 22:18:28 INFO Loading STRING file at '/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt' 2018-03-24 22:18:28 INFO KEGG = 'Homo sapiens (human)' -> RSAT = 'Homo_sapiens_GRCh38' 2018-03-24 22:18:28 INFO Creating networks... 2018-03-24 22:18:28 INFO stringdb.read_edges2() 2018-03-24 22:18:51 INFO Finished loading /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt 2018-03-24 22:19:10 INFO Processing network 5% 2018-03-24 22:19:11 INFO Processing network 10% 2018-03-24 22:19:13 INFO Processing network 15% 2018-03-24 22:19:14 INFO Processing network 20% 2018-03-24 22:19:15 INFO Processing network 25% 2018-03-24 22:19:16 INFO Processing network 30% 2018-03-24 22:19:17 INFO Processing network 35% 2018-03-24 22:19:18 INFO Processing network 40% 2018-03-24 22:19:19 INFO Processing network 45% 2018-03-24 22:19:20 INFO Processing network 50% 2018-03-24 22:19:21 INFO Processing network 55% 2018-03-24 22:19:22 INFO Processing network 60% 2018-03-24 22:19:23 INFO Processing network 65% 2018-03-24 22:19:24 INFO Processing network 70% 2018-03-24 22:19:25 INFO Processing network 75% 2018-03-24 22:19:26 INFO Processing network 80% 2018-03-24 22:19:27 INFO Processing network 85% 2018-03-24 22:19:28 INFO Processing network 90% 2018-03-24 22:19:29 INFO Processing network 95% 2018-03-24 22:19:30 INFO Processing network 100% 2018-03-24 22:19:30 WARNING 14444 (out of 18995736) nodes not found in canonical gene names 2018-03-24 22:19:30 INFO stringdb.read_edges2(), 782284 edges read, 8715584 edges ignored 2018-03-24 22:19:33 INFO Finished creating networks. 2018-03-24 22:19:43 ERROR No sequences read for hsa! Traceback (most recent call last): File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/bin/cmonkey2", line 36, in cmonkey_run.run() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 439, in run self.prepare_run() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 413, in prepare_run row_scoring, col_scoring = self.setup_pipeline() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 366, in setup_pipeline for fun in self['pipeline']['row-scoring']['args']['functions']] File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 474, in init ratios, 'upstream', config_params) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 166, in init self.__setup_meme_suite(config_params) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 134, in __setup_meme_suite bgorder=int(self.config_params['MEME']['background_order'])) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/meme.py", line 802, in global_background_file seqtype=seqtype) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 217, in sequences_for_genes_scan return self.sequence_source.seqs_for(genes, self.scan_distances[seqtype]) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 338, in seqs_for return {gene: unique_seqs[head] for gene, head in shifted_pairs} File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 338, in return {gene: unique_seqs[head] for gene, head in shifted_pairs} KeyError: 'ENST00000295971'

The ENST00000295971 transcript that gives the error is the first protein of the ratio list and it appears in feature.tab and feature_names.tab. The fact that it gives the correct transcript id in the error indicates that the id translation itself works (also if I give the ratio and string tables directly as transcript ids that don’t need to be translated I get the same error). The sequence-contig files do have the right names and are also in the required lowercase format. They often start with long stretches of n’s but that doesn’t seem to be a problem (the error remains the same if I replace the n’s with a’s).

Do you know where the error could come from?

Thank you so much for your help!

Best regards, Luzia

baliga-lab / cmonkey2

KeyError running human data #75