baliga-lab / cmonkey2

Python port of cMonkey, a machine-learning based method for clustering
GNU Lesser General Public License v3.0
26 stars 16 forks source link

local rsat files can't be read correctly #80

Closed hughit32 closed 5 months ago

hughit32 commented 5 months ago

Hello, I have data and genomic information from a fungus that I want to use directly in cMonkey2 without trying to download data. I tried to format the data exactly as specified on the wiki pages, but I'm still getting this error when I try to run:

cmonkey2 --out PPLresults --nostring --nooperons --num_iterations 250 --rsat_dir rsat --rsat_organism rsat/organism.tab --rsat_features feature.tab --organism ppl --synonym_file rsat/feature_names.tab rsat/forCmonkey2.tsv

2024-02-08 17:47:14 INFO checking MEME... 2024-02-08 17:47:15 INFO Input matrix has # rows: 22125, # columns: 101 2024-02-08 17:47:15 INFO # clusters/row: 2 2024-02-08 17:47:15 INFO # clusters/column: 1106 2024-02-08 17:47:15 INFO # CLUSTERS: 2212 2024-02-08 17:47:15 INFO use operons: 0 2024-02-08 17:47:15 INFO using MEME version 5.5.5 2024-02-08 17:47:25 INFO using RSAT files for 'rsat/organism.tab' 2024-02-08 17:47:25 INFO attempting automatic download of operons from Microbes Online 2024-02-08 17:47:25 INFO KEGG = 'Postia placenta Mad-698-R' -> RSAT = 'rsat/organism.tab' Traceback (most recent call last): File "/home/mitc633/.local/bin/cmonkey2", line 37, in cmonkey_run.run() File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 441, in run self.prepare_run() File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 407, in prepare_run thesaurus = self.organism().thesaurus() File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 156, in organism self.__organism = self.make_organism() File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/cmonkey_run.py", line 259, in make_organism synonyms = thesaurus.create_from_delimited_file2(self.config_params['synonym_file'], File "/home/mitc633/.local/lib/python3.8/site-packages/cmonkey/thesaurus.py", line 31, in create_from_delimited_file2 for alternative in line[1].split(';'): IndexError: list index out of range

It seems like the code that's generating this error is expecting a comma-delimited file, and when change the file I can get cMonkey2 to run, but it ultimately either fails to match the gene names in my input data to the genomic info. It seems like there is something fundamentally wrong with the way I've set up my files, but I can't figure out what it is.

here is the top of my feature_names.tab file: g1.1 g1 primary g2.1 g2 primary g3.1 g3 primary g4.1 g4 primary g5.1 g5 primary g6.1 g6 primary g7.1 g7 primary g8.1 g8 primary g9.1 g9 primary g10.1 g10 primary

and here is the top of my feature.tab file: id type name contig strand start end g1.1 gene g1 scaffold_728 + 1 1576 g2.1 gene g2 scaffold_728 - 5767 6857 g3.1 gene g3 scaffold_728 + 7249 8041 g4.1 gene g4 scaffold_728 + 8432 8867 g5.1 gene g5 scaffold_728 + 9196 9892 g6.1 gene g6 scaffold_1638 - 1 466 g7.1 gene g7 scaffold_1638 + 7105 8032 g8.1 gene g8 scaffold_29 + 1 1746 g9.1 gene g9 scaffold_29 - 2754 3898

and here is my organism.tab file: 561896 Eukaryota; Fungi; Agaricomycetes; Polyporales; Fomitopsidaceae; Rhodonia; placenta

all of these are in folder names 'rsat'. The row names in my gene expression table match the 'name' column in feature.tab

Any help would be greatly appreciated!

weiju commented 5 months ago

Hi, since you already provided the feature_names.tab files in the RSAT folder, I would recommend not to explicitly specify the synonym_file. The idea is that the RSAT synonyms would be taken from the RSAT folder instead, whereas the synonym_file switch would be used for a dedicated comma-separated synonym file. Sorry that it seems a bit confusing. Hope this helps

hughit32 commented 5 months ago

Thank you for the tip. I have cMonkey2 running now! It took a fair amount of trial and error to get input files recognized properly so that it would run while matching up genes with their appropriate sequences. In addition to the 'Input file formats' section of the wiki page for this project, here are details that I found might be helpful to anyone doing something similar (i.e. providing all inputs directly to cMonkey, rather than downloading).

  1. cMonkey2 adds .tab for the feature details file, and _names.tab for the feature names file. BOTH of these files are specified with the --rsat_features parameter. If I named my feature details file myFeatures.tab and my feature names file myFeatures_names.tab, then I would specifiy --rsat_features myFeatures. Both of these must be tab-delimited.
  2. The feature details file (very similar to a gff file in content) needs to be specified this way: column1: gene name, column2: feature type i.e. "gene", column3: alternative gene id, column4: contig name, column5; gene start, column6: gene end, column7: strand, with 'F' for forward strand and 'R' for reverse strand. Not clear if columns 2 and 3 are used.
  3. Using a genome level fasta file does not work, I believe a gene-level fasta file is what is intended for that. It seems to be necessary to provide one file for each contig as the input format page describes.
  4. For my organism, there is just one set of gene names. The way I got this to run was to include these gene names as BOTH column 1 and 2 in my feature names file (myFeatures_names.tab in the example above). If your genome includes alternative feature names (i.e. your data rows have different labels than feature names in your genome info) you could experiment with putting the original and alternative names in columns 1 and 2, or 2 and 1 respectively. Or if you can clarify this that would be helpful.
  5. Once you set --rsat_dir as a home for genome sequence files and feature data, then everything can go in there, and you don't need to specify paths elsewhere. E.g. --rsat_organism is set with just the name of the file, e.g. organism.tab, rather than the path to the file. Thanks for your efforts with this software!