Closed hughit32 closed 5 months ago
Hi, since you already provided the feature_names.tab files in the RSAT folder, I would recommend not to explicitly specify the synonym_file. The idea is that the RSAT synonyms would be taken from the RSAT folder instead, whereas the synonym_file switch would be used for a dedicated comma-separated synonym file. Sorry that it seems a bit confusing. Hope this helps
Thank you for the tip. I have cMonkey2 running now! It took a fair amount of trial and error to get input files recognized properly so that it would run while matching up genes with their appropriate sequences. In addition to the 'Input file formats' section of the wiki page for this project, here are details that I found might be helpful to anyone doing something similar (i.e. providing all inputs directly to cMonkey, rather than downloading).
.tab
for the feature details file, and _names.tab
for the feature names file. BOTH of these files are specified with the --rsat_features
parameter. If I named my feature details file myFeatures.tab
and my feature names file myFeatures_names.tab
, then I would specifiy --rsat_features myFeatures
. Both of these must be tab-delimited.--rsat_dir
as a home for genome sequence files and feature data, then everything can go in there, and you don't need to specify paths elsewhere. E.g. --rsat_organism
is set with just the name of the file, e.g. organism.tab, rather than the path to the file.
Thanks for your efforts with this software!
Hello, I have data and genomic information from a fungus that I want to use directly in cMonkey2 without trying to download data. I tried to format the data exactly as specified on the wiki pages, but I'm still getting this error when I try to run:
cmonkey2 --out PPLresults --nostring --nooperons --num_iterations 250 --rsat_dir rsat --rsat_organism rsat/organism.tab --rsat_features feature.tab --organism ppl --synonym_file rsat/feature_names.tab rsat/forCmonkey2.tsv
2024-02-08 17:47:14 INFO checking MEME... 2024-02-08 17:47:15 INFO Input matrix has # rows: 22125, # columns: 101 2024-02-08 17:47:15 INFO # clusters/row: 2 2024-02-08 17:47:15 INFO # clusters/column: 1106 2024-02-08 17:47:15 INFO # CLUSTERS: 2212 2024-02-08 17:47:15 INFO use operons: 0 2024-02-08 17:47:15 INFO using MEME version 5.5.5 2024-02-08 17:47:25 INFO using RSAT files for 'rsat/organism.tab' 2024-02-08 17:47:25 INFO attempting automatic download of operons from Microbes Online 2024-02-08 17:47:25 INFO KEGG = 'Postia placenta Mad-698-R' -> RSAT = 'rsat/organism.tab' Traceback (most recent call last): File "/home/mitc633/.local/bin/cmonkey2", line 37, in
cmonkey_run.run()
File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 441, in run
self.prepare_run()
File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 407, in prepare_run
thesaurus = self.organism().thesaurus()
File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 156, in organism
self.__organism = self.make_organism()
File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 259, in make_organism
synonyms = thesaurus.create_from_delimited_file2(self.config_params['synonym_file'],
File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/thesaurus.py", line 31, in create_from_delimited_file2
for alternative in line[1].split(';'):
IndexError: list index out of range
It seems like the code that's generating this error is expecting a comma-delimited file, and when change the file I can get cMonkey2 to run, but it ultimately either fails to match the gene names in my input data to the genomic info. It seems like there is something fundamentally wrong with the way I've set up my files, but I can't figure out what it is.
here is the top of my feature_names.tab file: g1.1 g1 primary g2.1 g2 primary g3.1 g3 primary g4.1 g4 primary g5.1 g5 primary g6.1 g6 primary g7.1 g7 primary g8.1 g8 primary g9.1 g9 primary g10.1 g10 primary
and here is the top of my feature.tab file: id type name contig strand start end g1.1 gene g1 scaffold_728 + 1 1576 g2.1 gene g2 scaffold_728 - 5767 6857 g3.1 gene g3 scaffold_728 + 7249 8041 g4.1 gene g4 scaffold_728 + 8432 8867 g5.1 gene g5 scaffold_728 + 9196 9892 g6.1 gene g6 scaffold_1638 - 1 466 g7.1 gene g7 scaffold_1638 + 7105 8032 g8.1 gene g8 scaffold_29 + 1 1746 g9.1 gene g9 scaffold_29 - 2754 3898
and here is my organism.tab file: 561896 Eukaryota; Fungi; Agaricomycetes; Polyporales; Fomitopsidaceae; Rhodonia; placenta
all of these are in folder names 'rsat'. The row names in my gene expression table match the 'name' column in feature.tab
Any help would be greatly appreciated!