baliga-lab / cmonkey2

Python port of cMonkey, a machine-learning based method for clustering
GNU Lesser General Public License v3.0
26 stars 16 forks source link

Errors in attempted mouse and human runs #47

Closed vectorborne5 closed 6 years ago

vectorborne5 commented 9 years ago

I recently attempted two separate runs using microarray data from human and mouse samples.

I collected all protein network data from STRING v9.1 for use as the string files with either species, formatting the files as Gene1Gene2Normalized Score. Moreover, my ratios files were formatted with matching Ensembl IDs in expression matrix tab-delimited .tsv files.

1) After initiating a run using mouse data (with organism ID = mmu), all initial checks appeared to proceed normally, but I eventually received the following error:

cmonkey.util.DocumentNotFound: //rsat01.biologie.ens.fr/rsat//data/genomes/Mus_musculus_EnsEMBL/genome/feature_names.tab

This file does, indeed, not exist in RSAT, but a file 'gene_names.tab' does exist. Is this a substitute? Can I make any changes to the cmonkey files to account for this?

2) Following the suggestions for a run of human data presented in the readme.md file, and using input file formats similar to those used for the mouse data, again all initial tests seemed to run well until I received this error:

IndexError: list index out of range

I am baffled as to the meaning of this error. What changes must I make to resolve this issue?

weiju commented 9 years ago

Hi,

this section of the Wiki

https://github.com/baliga-lab/cmonkey2/wiki/Input-file-formats

("RSAT mockup directories") describes the necessary files for RSAT, we have been using mouse and human but usually use customized data that is provided in a RSAT-like structure and specifying with the data directory using the --rsat_dir directory.

The list index out of range message usually hints at missing gene names, but it might be easier to tell with the complete error output provided.

vectorborne5 commented 9 years ago

I've seemingly fixed the "list index out of range" problem (there was apparently a stray tab somewhere in my string file). However, I now encounter a new error after the program processes a bit further in the run:

Nick3$./cmonkey.py --organism mmu --ratios /home/nick/Desktop/mmu_test.tsv --string /home/nick/Desktop/string/mmu.gz --rsat_dir /home/nick/Desktop/rsat/mmu/ --rsat_organism Mus_musculus_EnsEMBL --rsat_features protein_coding --nooperons
2015-03-10 16:24:00 INFO     checking MEME...
2015-03-10 16:24:01 INFO     Input matrix has # rows: 16634, # columns: 4
2015-03-10 16:24:01 INFO     # clusters/row: 2
2015-03-10 16:24:01 INFO     # clusters/column: 1109
2015-03-10 16:24:01 INFO     # CLUSTERS: 1663
2015-03-10 16:24:01 INFO     use operons: 0
2015-03-10 16:24:01 INFO     using MEME version 4.10.0
2015-03-10 16:24:02 INFO     using RSAT files for 'Mus_musculus_EnsEMBL'
2015-03-10 16:24:02 INFO     attempting automatic download of operons from Microbes Online
2015-03-10 16:24:02 INFO     Loading STRING file at '/home/nick/Desktop/string/mmu.gz'
2015-03-10 16:24:02 INFO     KEGG = 'Mus musculus (house mouse)' -> RSAT = 'Mus_musculus_EnsEMBL'
2015-03-10 16:24:02 INFO     Creating networks...
2015-03-10 16:24:02 INFO     stringdb.read_edges2()
2015-03-10 16:24:27 INFO     Finished loading /home/nick/Desktop/string/mmu.gz
2015-03-10 16:24:28 INFO     Processing network 5%
2015-03-10 16:24:29 INFO     Processing network 10%
2015-03-10 16:24:29 INFO     Processing network 15%
2015-03-10 16:24:30 INFO     Processing network 20%
2015-03-10 16:24:31 INFO     Processing network 25%
2015-03-10 16:24:31 INFO     Processing network 30%
2015-03-10 16:24:32 INFO     Processing network 35%
2015-03-10 16:24:32 INFO     Processing network 40%
2015-03-10 16:24:33 INFO     Processing network 45%
2015-03-10 16:24:33 INFO     Processing network 50%
2015-03-10 16:24:34 INFO     Processing network 55%
2015-03-10 16:24:34 INFO     Processing network 60%
2015-03-10 16:24:35 INFO     Processing network 65%
2015-03-10 16:24:35 INFO     Processing network 70%
2015-03-10 16:24:36 INFO     Processing network 75%
2015-03-10 16:24:36 INFO     Processing network 80%
2015-03-10 16:24:37 INFO     Processing network 85%
2015-03-10 16:24:37 INFO     Processing network 90%
2015-03-10 16:24:38 INFO     Processing network 95%
2015-03-10 16:24:38 INFO     Processing network 100%
2015-03-10 16:24:38 INFO     stringdb.read_edges2(), 94 edges read, 4850754 edges ignored
2015-03-10 16:24:39 INFO     Finished creating networks.
Traceback (most recent call last):
  File "./cmonkey.py", line 36, in 
    cmonkey_run.run()
  File "/home/nick/Desktop/cmonkey2/cmonkey/cmonkey_run.py", line 505, in run
    self.prepare_run()
  File "/home/nick/Desktop/cmonkey2/cmonkey/cmonkey_run.py", line 472, in prepare_run
    row_scoring, col_scoring = self.__setup_pipeline()
  File "/home/nick/Desktop/cmonkey2/cmonkey/cmonkey_run.py", line 425, in __setup_pipeline
    for fun in self['pipeline']['row-scoring']['args']['functions']]
  File "/home/nick/Desktop/cmonkey2/cmonkey/cmonkey_run.py", line 206, in membership
    self.__membership = self.__make_membership()
  File "/home/nick/Desktop/cmonkey2/cmonkey/cmonkey_run.py", line 200, in __make_membership
    self.config_params)
  File "/home/nick/Desktop/cmonkey2/cmonkey/membership.py", line 310, in create_membership
    config_params, matrix.row_indexes, matrix.column_indexes)
  File "/home/nick/Desktop/cmonkey2/cmonkey/membership.py", line 78, in __init__
    self.row_membs[self.rowidx[row]][i] = tmp[i]
IndexError: index 15226 is out of bounds for axis 0 with size 15059

I know this has to do with trying to call a non-existent element, but I'm not sure how precisely to modify any of my input or reference files to counteract this problem.

vectorborne5 commented 9 years ago

I've decided to go as rudimentary as possible, using input ratios of about 300 familiar genes, many of which share a common expression profile, and/or have common transcriptional motifs. I felt encouraged as the run finally proceeded into iterations, but received the following errors:

The first:

...
2015-03-17 01:51:55 INFO     running meme/mast on cluster 1470, # sequences: 7
2015-03-17 01:51:55 WARNING  there is an exception thrown in MAST: Errors from MEME text parser:
The pspm of motif 1 has an evalue value 1000 which does not match the existing value of 1000.
FATAL: No motifs.
2015-03-17 01:51:55 INFO     running meme/mast on cluster 1276, # sequences: 7
...

The second (which repeats many times over):

...
2015-03-17 04:13:03 ERROR    No sequences read for hsa!
2015-03-17 04:13:03 WARNING  Cluster 2 with 0 genes: no sequences!
...

My commands to initiate the run are as follows (just FYI):

./cmonkey.py --organism hsa --ratios hsa_test.tsv --string hsa.gz --rsat_organism Homo_sapiens_GRCh37 --rsat_base_url http://rsat.sb-roscoff.fr/ --rsat_features protein_coding --nooperons

I've been struggling with this program for days now. Is there a list of specific guidelines for dealing with human runs beyond that which is listed in the README.md? I seem to be stymied at every step.

weiju commented 9 years ago

Hi, sorry for the late reply. Is there a way you could send me a link to hsa_test.tsv and hsa.gz ? Thanks, Wei-ju

vectorborne5 commented 9 years ago

This link has both files present https://app.box.com/s/3phu6800og1re21ehbe7sq2rd2n0f3hc

weiju commented 9 years ago

Thank you very much, I will have a closer look at it