brian-cleary / LatentStrainAnalysis

Partitioning and analysis methods for large, complex sequence datasets
MIT License
38 stars 21 forks source link

Error in run of kmer_lsi.py #7

Open spleonard1 opened 9 years ago

spleonard1 commented 9 years ago

I ran through the test data set without issue, and am using your scripts on a metagenomic data set ~100 GB. Our cluster uses SLURM job submission, so I'm trying to do a dry run on my desktop (mac) before adapting the scripts. Haven't had any real issues until the "Calculating the SVD (streaming!)" step, when I got following error/traceback. I don't have any experience with Pyro4, but it looks like it doesn't know how to "serialize" the numpy array. Any ideas? Let me know if I can provide you any other helpful information.
python LSA/kmer_lsi.py -i ./hashed_reads/ -o ./cluster_vectors/ 2015-11-04 00:31:32,643 : INFO : using distributed version with 5 workers 2015-11-04 00:31:32,643 : INFO : updating model with new documents 2015-11-04 00:31:32,643 : INFO : initializing 5 workers 2015-11-04 00:31:33,112 : INFO : preparing a new chunk of documents Traceback (most recent call last): File "LSA/kmer_lsi.py", line 41, in lsi = hashobject.train_kmer_lsi(corpus,num_dims=len(hashobject.path_dict)*4/5,single=singleInstance) File "/Users/seanleonard/Desktop/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 82, in train_kmer_lsi return models.LsiModel(kmer_corpus,num_topics=num_dims,id2word=self.path_dict,distributed=True,chunksize=200000) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 329, in init self.add_documents(corpus) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 382, in add_documents self.dispatcher.putjob(job) # put job into queue; this will eventually block, because the queue has a small finite size File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 171, in call return self.send(self.__name, args, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 394, in _pyroInvoke compress=Pyro4.config.COMPRESSION) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 167, in serializeCall data = self.dumpsCall(obj, method, vargs, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 476, in dumpsCall return serpent.dumps((obj, method, vargs, kwargs), module_in_classname=True) File "/Library/Python/2.7/site-packages/serpent.py", line 78, in dumps return Serializer(indent, set_literals, module_in_classname).serialize(obj) File "/Library/Python/2.7/site-packages/serpent.py", line 250, in serialize self._serialize(obj, out, 0) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatch[t](self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatch[t](self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 552, in ser_default_class self._serialize(value, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatch[t](self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 431, in ser_builtins_dict serialize(v, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 551, in ser_default_class raise TypeError("don't know how to serialize class " + str(obj.__class) + ". Give it vars() or an appropriate getstate") TypeError: don't know how to serialize class <type 'numpy.ndarray'>. Give it vars() or an appropriate getstate

Thanks!

brian-cleary commented 8 years ago

Hi,

I apologize for the super slow response - somehow missed this.

Are you still having this problem? If you could send me the output of "ls -l" for hashed_reads/ and cluster_vectors/ that would help me to diagnose the problem. Sometimes a failure in an earlier step could lead to this problem.

On Wed, Nov 4, 2015 at 1:42 AM, Sean Leonard notifications@github.com wrote:

I ran through the test data set without issue, and am using your scripts on a metagenomic data set ~100 GB. Our cluster uses SLURM job submission, so I'm trying to do a dry run on my desktop (mac) before adapting the scripts. Haven't had any real issues until the "Calculating the SVD (streaming!)" step, when I got following error/traceback. I don't have any experience with Pyro4, but it looks like it doesn't know how to "serialize" the bumpy array. Any ideas? Let me know if I can provide you any other helpful information.

python LSA/kmer_lsi.py -i ./hashed_reads/ -o ./cluster_vectors/ 2015-11-04 00:31:32,643 : INFO : using distributed version with 5 workers 2015-11-04 00:31:32,643 : INFO : updating model with new documents 2015-11-04 00:31:32,643 : INFO : initializing 5 workers 2015-11-04 00:31:33,112 : INFO : preparing a new chunk of documents Traceback (most recent call last): File "LSA/kmer_lsi.py", line 41, in lsi = hashobject.train_kmer_lsi(corpus,num_dims=len(hashobject.path_dict)_4/5,single=singleInstance) File "/Users/seanleonard/Desktop/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 82, in train_kmer_lsi return models.LsiModel(kmer_corpus,num_topics=num_dims,id2word=self.pathdict,distributed=True,chunksize=200000) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 329, in *init self.add_documents(corpus) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 382, in add_documents self.dispatcher.putjob(job) # put job into queue; this will eventually block, because the queue has a small finite size File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 171, in call return self.

_send(self.__name, args, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 394, in _pyroInvoke compress=Pyro4.config.COMPRESSION) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 167, in serializeCall data = self.dumpsCall(obj, method, vargs, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 476, in dumpsCall return serpent.dumps((obj, method, vargs, kwargs), module_in_classname=True) File "/Library/Python/2.7/site-packages/serpent.py", line 78, in dumps return Serializer(indent, set_literals, module_in_classname).serialize(obj) File "/Library/Python/2.7/site-packages/serpent.py", line 250, in serialize self._serialize(obj, out, 0) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 552, in ser_default_class self._serialize(value, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 431, in ser_builtins_dict serialize(v, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 551, in ser_default_class raise TypeError("don't know how to serialize class " + str(obj._class) + ". Give it vars() or an appropriate getstate") TypeError: don't know how to serialize class . Give it vars() or an appropriate getstate

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/7.

brian-cleary commented 8 years ago

Also - you might want to try running the test data using the streaming svd (locally), just to see if you have the correct environment.

On Tue, Dec 1, 2015 at 10:12 AM, Brian Cleary cleary.brian@gmail.com wrote:

Hi,

I apologize for the super slow response - somehow missed this.

Are you still having this problem? If you could send me the output of "ls -l" for hashed_reads/ and cluster_vectors/ that would help me to diagnose the problem. Sometimes a failure in an earlier step could lead to this problem.

On Wed, Nov 4, 2015 at 1:42 AM, Sean Leonard notifications@github.com wrote:

I ran through the test data set without issue, and am using your scripts on a metagenomic data set ~100 GB. Our cluster uses SLURM job submission, so I'm trying to do a dry run on my desktop (mac) before adapting the scripts. Haven't had any real issues until the "Calculating the SVD (streaming!)" step, when I got following error/traceback. I don't have any experience with Pyro4, but it looks like it doesn't know how to "serialize" the bumpy array. Any ideas? Let me know if I can provide you any other helpful information.

python LSA/kmer_lsi.py -i ./hashed_reads/ -o ./cluster_vectors/ 2015-11-04 00:31:32,643 : INFO : using distributed version with 5 workers 2015-11-04 00:31:32,643 : INFO : updating model with new documents 2015-11-04 00:31:32,643 : INFO : initializing 5 workers 2015-11-04 00:31:33,112 : INFO : preparing a new chunk of documents Traceback (most recent call last): File "LSA/kmer_lsi.py", line 41, in lsi = hashobject.train_kmer_lsi(corpus,num_dims=len(hashobject.path_dict)_4/5,single=singleInstance) File "/Users/seanleonard/Desktop/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 82, in train_kmer_lsi return models.LsiModel(kmer_corpus,num_topics=num_dims,id2word=self.pathdict,distributed=True,chunksize=200000) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 329, in *init self.add_documents(corpus) File "/Library/Python/2.7/site-packages/gensim/models/lsimodel.py", line 382, in add_documents self.dispatcher.putjob(job) # put job into queue; this will eventually block, because the queue has a small finite size File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 171, in call return self.

_send(self.__name, args, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/core.py", line 394, in _pyroInvoke compress=Pyro4.config.COMPRESSION) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 167, in serializeCall data = self.dumpsCall(obj, method, vargs, kwargs) File "/Library/Python/2.7/site-packages/Pyro4/util.py", line 476, in dumpsCall return serpent.dumps((obj, method, vargs, kwargs), module_in_classname=True) File "/Library/Python/2.7/site-packages/serpent.py", line 78, in dumps return Serializer(indent, set_literals, module_in_classname).serialize(obj) File "/Library/Python/2.7/site-packages/serpent.py", line 250, in serialize self._serialize(obj, out, 0) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 369, in ser_builtins_tuple serialize(elt, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 552, in ser_default_class self._serialize(value, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 271, in _serialize return self.dispatcht http://self,%20obj,%20out,%20level File "/Library/Python/2.7/site-packages/serpent.py", line 431, in ser_builtins_dict serialize(v, out, level + 1) File "/Library/Python/2.7/site-packages/serpent.py", line 291, in _serialize f(self, obj, out, level) File "/Library/Python/2.7/site-packages/serpent.py", line 551, in ser_default_class raise TypeError("don't know how to serialize class " + str(obj._class) + ". Give it vars() or an appropriate getstate") TypeError: don't know how to serialize class . Give it vars() or an appropriate getstate

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/7.

nmb85 commented 8 years ago

Hi, I am having the same issue. The software worked fine with the test data that you provided. Here is the error message:

KmerLSI.err.txt

Here is the worker error log file:

worker1.log.txt

And here are the output from ls -l for hashed_reads/ and cluster_vector/:

ls_-l_cluster_vectors.txt ls_-l_hashed_reads.txt

Does it have something to do with this? https://pythonhosted.org/Pyro4/tipstricks.html#pyro-and-numpy

Maybe something in my environment like you said above? I had to edit the KmerLSI.py script for SGE:

KmerLSI_Job.q.txt

Thanks again, Brian!

brian-cleary commented 8 years ago

Hm. So it's still not clear to me if it's the environment.

Can you see if you are able to run the distributed version of the SVD on the test data? The "Getting started" stuff uses the single instance version of the SVD, but you should be able to run the test data up to that point, and then try the SVD with a couple of different workers.

This will help us clarify if you can run that portion of the code at all, or if there is maybe something funky in the data that is fed into the SVD.

Also, I note that you're setting up your PATH after starting all Pyro processes...not sure if this is significant (or intentional).

On Mon, Dec 14, 2015 at 3:07 PM, russianconcussion <notifications@github.com

wrote:

Hi, I am having the same issue. Here is the error message:

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/7#issuecomment-164544723 .

nmb85 commented 8 years ago

Okay, I'll check those: first, setting up the path before starting Pyro4, then running the test data with a few workers. Thanks, Brian!

nmb85 commented 8 years ago

Hi, Brian,

I killed two birds with one stone and exported my PATH before running Pyro4 on the test data (the exact same files that the single instance was able to process). Here is the output, which looks pretty much the same. It's still complaining about numpy:

KmerLSI_path1st.err.txt KmerLSI_path1st.out.txt

I am running my large dataset with the single thread version of KmerLSI and it is working fine. I'm pretty sure that there is something wrong with the multi-processing. How much more time would it take to run the single instance version (5x?)?

nmb85 commented 8 years ago

No problem with the multithreading; that was a red herring. Just trouble with Pyro4.

nmb85 commented 8 years ago

Hi, Brian,

Okay, so I've been able to successfully run a 2nd large dataset (~240Gb) through kmer_lsi.py on single-thread mode, but it takes days. The problem with using Pyro4 still persists with this dataset, though.

Are you sure that this isn't the problem: https://pythonhosted.org/Pyro4/tipstricks.html#pyro-and-numpy

Is there any way I can edit kmer_lsi.py to test if this is the problem?

Thanks for your time!

sunitj commented 8 years ago

Just wondering if this was ever solved? Even though it seems to successfully run on the test dataset provided with the repo, it seems to produce the error for my test dataset.

nmb85 commented 8 years ago

Hi, @sunitj, FWIW I'm afraid I wasn't able to solve it myself and I haven't heard from Brian Cleary since December. It's pretty clear to me that the lsi script works fine in single-core mode (as it's run on the test dataset, but also works fine on a large dataset, just takes a long weekend), but I cannot get it to work on multiple cores/nodes with the Pyro4 library. Pyro4 complains about the numpy format, but I'm not sure exactly how to fix this in kmer_lsi.py.

jmeppley commented 8 years ago

I'm seeing the same issue. As @russianconcussion suggested, I'm simply running it single threaded for now.

jmeppley commented 8 years ago

I've gotten past the serialization error by inserting the following before launching the pyro4 nameserver:

export PYRO_SERIALIZERS_ACCEPTED=serpent,json,marshal,pickle
export PYRO_SERIALIZER=pickle

I'm now getting this error:

Date: Wed Feb 10 08:47:29 HST 2016
2016-02-10 08:48:03,862 : ERROR : failed to initialize distributed LSI (unknown name: gensim.lsi_dispatcher)
Traceback (most recent call last):
  File "LSA/kmer_lsi.py", line 41, in <module>
    lsi = hashobject.train_kmer_lsi(corpus,num_dims=len(hashobject.path_dict)*4/5,single=singleInstance)
  File "/mnt/lysine/assemblies/CSHLII/eigenomes/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 82, in train_kmer_lsi
    return models.LsiModel(kmer_corpus,num_topics=num_dims,id2word=self.path_dict,distributed=True,chunksize=200000)
  File "/opt/virtualenv/eigengenomes/lib/python2.7/site-packages/gensim/models/lsimodel.py", line 326, in __init__
    raise RuntimeError("failed to initialize distributed LSI (%s)" % err)
RuntimeError: failed to initialize distributed LSI (unknown name: gensim.lsi_dispatcher)
Date: Wed Feb 10 08:48:03 HST 2016

This looks like a collision of nameservers and may be unique to my setup. @russianconcussion and @sunitj, let me know if this gets you anywhere.

jmeppley commented 8 years ago

I have it working.

I am not using a distributed cluster, though, just a monolithic multi-core machine with loads of RAM. This does enable multithreading though.

I generated the stock bash script with python LSFScripts/create_jobs.py -j KmerLSI -i ./ and added the following lines to the generated script LSFScripts/KmerLSI_Job.q just before the first python call:

export PYRO_SERIALIZERS_ACCEPTED=serpent,json,marshal,pickle
export PYRO_SERIALIZER=pickle
export PYRO_NS_HOST=localhost
export PYRO_NS_PORT=65431
export PYRO_HOST=localhost

...and executed th script with bash LSFScripts/KmerLSI_Job.q

I was having problems with address collisions on my server (hence the HOST and PORT settings) and with serialization as @russionconcussion and @sunitj experienced.

sjspence commented 8 years ago

I encountered the same issue using @russianconcussion's SGE scripts on an Amazon Web Services starcluster. @jmeppley's workaround solved my multi-threading problem, and I also used screen and ran the bash command in the background for convenience. Thanks all!

wichne commented 7 years ago

Does anyone know if @jmeppley's solution will work on a distributed system? If not, how can it be adapted to do so?