baliga-lab / cmonkey2

Python port of cMonkey, a machine-learning based method for clustering
GNU Lesser General Public License v3.0
26 stars 16 forks source link

KeyError running human data #75

Closed LuziaThea closed 6 years ago

LuziaThea commented 6 years ago

Hello,

I am running cmonkey2 on human data. I downloaded the RSAT files for Homo_sapiens_GRCh38 and use them with the –rsat_dir, --rsat_features and –rsat_organism option. I am using a ratio file with protein expression data with uniprot ids. My String file has also uniprot ids. I downloaded the protein_coding.tab, the protein_coding_names.tab and I added the ensembl transcript to uniprot id translations to the protein_coding_names.tab.

In my RSAT directory are the following files: organism.tab feature_names.tab (based on protein_coding_names.tab) feature.tab (based on protein_coding.tab) all contig files

I get the following error:

python /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/bin/cmonkey2 \ /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/NCI60_Ratio_log2_uniprot.txt \ --organism hsa \ --string /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt \ --rsat_organism Homo_sapiens_GRCh38 \ --rsat_dir /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/RSAT \ --rsat_features feature \ --nooperons \ --out ./Output_NCI60_17

2018-03-24 22:18:25 INFO checking MEME... 2018-03-24 22:18:26 INFO Input matrix has # rows: 3171, # columns: 59 2018-03-24 22:18:26 INFO # clusters/row: 2 2018-03-24 22:18:26 INFO # clusters/column: 211 2018-03-24 22:18:26 INFO # CLUSTERS: 317 2018-03-24 22:18:26 INFO use operons: 0 2018-03-24 22:18:26 INFO using MEME version 4.10.2 2018-03-24 22:18:28 INFO using RSAT files for 'Homo_sapiens_GRCh38' 2018-03-24 22:18:28 INFO attempting automatic download of operons from Microbes Online 2018-03-24 22:18:28 INFO Loading STRING file at '/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt' 2018-03-24 22:18:28 INFO KEGG = 'Homo sapiens (human)' -> RSAT = 'Homo_sapiens_GRCh38' 2018-03-24 22:18:28 INFO Creating networks... 2018-03-24 22:18:28 INFO stringdb.read_edges2() 2018-03-24 22:18:51 INFO Finished loading /nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/staldluz/CMonkey2_data/Human_new/String_human_uniprot_complete.txt 2018-03-24 22:19:10 INFO Processing network 5% 2018-03-24 22:19:11 INFO Processing network 10% 2018-03-24 22:19:13 INFO Processing network 15% 2018-03-24 22:19:14 INFO Processing network 20% 2018-03-24 22:19:15 INFO Processing network 25% 2018-03-24 22:19:16 INFO Processing network 30% 2018-03-24 22:19:17 INFO Processing network 35% 2018-03-24 22:19:18 INFO Processing network 40% 2018-03-24 22:19:19 INFO Processing network 45% 2018-03-24 22:19:20 INFO Processing network 50% 2018-03-24 22:19:21 INFO Processing network 55% 2018-03-24 22:19:22 INFO Processing network 60% 2018-03-24 22:19:23 INFO Processing network 65% 2018-03-24 22:19:24 INFO Processing network 70% 2018-03-24 22:19:25 INFO Processing network 75% 2018-03-24 22:19:26 INFO Processing network 80% 2018-03-24 22:19:27 INFO Processing network 85% 2018-03-24 22:19:28 INFO Processing network 90% 2018-03-24 22:19:29 INFO Processing network 95% 2018-03-24 22:19:30 INFO Processing network 100% 2018-03-24 22:19:30 WARNING 14444 (out of 18995736) nodes not found in canonical gene names 2018-03-24 22:19:30 INFO stringdb.read_edges2(), 782284 edges read, 8715584 edges ignored 2018-03-24 22:19:33 INFO Finished creating networks. 2018-03-24 22:19:43 ERROR No sequences read for hsa! Traceback (most recent call last): File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/bin/cmonkey2", line 36, in cmonkey_run.run() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 439, in run self.prepare_run() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 413, in prepare_run row_scoring, col_scoring = self.setup_pipeline() File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/cmonkey_run.py", line 366, in setup_pipeline for fun in self['pipeline']['row-scoring']['args']['functions']] File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 474, in init ratios, 'upstream', config_params) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 166, in init self.__setup_meme_suite(config_params) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/motif.py", line 134, in __setup_meme_suite bgorder=int(self.config_params['MEME']['background_order'])) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/meme.py", line 802, in global_background_file seqtype=seqtype) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 217, in sequences_for_genes_scan return self.sequence_source.seqs_for(genes, self.scan_distances[seqtype]) File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 338, in seqs_for return {gene: unique_seqs[head] for gene, head in shifted_pairs} File "/nfs/nas21.ethz.ch/nas/fs2102/biol_ibt_usr_s1/bamir/Computation_on_Clusters/Virtual_env_Miniconda_euler/miniconda2/envs/cmonkey/lib/python2.7/site-packages/cmonkey/organism.py", line 338, in return {gene: unique_seqs[head] for gene, head in shifted_pairs} KeyError: 'ENST00000295971'

The ENST00000295971 transcript that gives the error is the first protein of the ratio list and it appears in feature.tab and feature_names.tab. The fact that it gives the correct transcript id in the error indicates that the id translation itself works (also if I give the ratio and string tables directly as transcript ids that don’t need to be translated I get the same error). The sequence-contig files do have the right names and are also in the required lowercase format. They often start with long stretches of n’s but that doesn’t seem to be a problem (the error remains the same if I replace the n’s with a’s).

Do you know where the error could come from?

Thank you so much for your help!

Best regards, Luzia

weiju commented 6 years ago

Hi Luzia, thanks for your report. From the information in your description my first guess would be that somehow the feature id in the RSAT features file does not match what was chosen as the left side in the synonyms file. BTW, I can't remember if I have used the rsat_dir and rsat_features in combination, so if you already have an rsat_dir option I would recommend to use that and follow the directory structure as described in

http://baliga-lab.github.io/cmonkey2/input_format.html

That's a guess though, if that doe not work we might have to take a look at your RSAT files. Hopefully that gets us a bit further. Please let me know how it goes for you !