BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
161 stars 36 forks source link

Memory Error in Bi-CoPaM clustering method #4

Closed phpeters closed 7 years ago

phpeters commented 7 years ago

Hej Basel,

thanks for publishing clust! I gave it a try and for the examples and for a gene set of size 30000 it worked just fine ot of the box. Great!

However, when I tried with a gene set with approx. 70000 genes (with 11 replicates in total and for 5 time points), clust threw an error in step 3, the Bi-CoPaM method. I attached the error log below. Running the command with more than 1 cpu creates a longer, but similar error.

Do you have an idea how to get rid of it?

Thanks and best regards! Philipp

`clust Data/ -n Normalisation.txt -r Replicates.txt -o results/

/===========================================================================\ | Clust | | (Optimised consensus clustering of multiple heterogenous datasets) | | Python package version 1.1.4 (2017) Basel Abu-Jamous | +---------------------------------------------------------------------------+ | Analysis started at: Thursday 11 May 2017 (14:40:58) | | 1. Reading datasets | | 2. Data pre-processing | | 3. Seed clusters production (the Bi-CoPaM method) | Traceback (most recent call last): File "/software/Clust/clust.py", line 6, in main(args) File "/software/Clust/clust/main.py", line 100, in main args.q3s) File "/software/Clust/clust/clustpipeline.py", line 97, in clustpipeline ncores=ncores) File "/software/Clust/clust/scripts/uncles.py", line 380, in uncles (Xloc[l], Ks[ki], Ds[ki], methodsDetailedloc[l], GDMloc[:, l], Ng) for ki in range(NKs)) File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator): File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks) File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func) File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init self.results = batch() File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items] File "/software/Clust/clust/scripts/uncles.py", line 313, in clustDataset tmpU = cl.clusterdataset(X, K, D, methods) # Obtain U's File "/software/Clust/clust/scripts/clustering.py", line 24, in clusterdataset U[ms] = chc(X, K, methodsloc[ms][1:]) File "/software/Clust/clust/scripts/clustering.py", line 73, in chc Z = sphc.linkage(X, method=linkage_method, metric=distance) File "/software/python/Python2.7/lib/python2.7/site-packages/scipy/cluster/hierarchy.py", line 669, in linkage int(_cpy_euclid_methods[method])) File "scipy/cluster/_hierarchy.pyx", line 740, in scipy.cluster._hierarchy.linkage (scipy/cluster/_hierarchy.c:9172) File "scipy/cluster/stringsource", line 1281, in View.MemoryView.memoryview_copy_contents (scipy/cluster/_hierarchy.c:23661) File "scipy/cluster/stringsource", line 1237, in View.MemoryView._err_extents (scipy/cluster/_hierarchy.c:23211) ValueError: got differing extents in dimension 0 (got 336196312 and 2483679960) `

BaselAbujamous commented 7 years ago

Hi Philipp

Thank you very much for reporting this issue, which I have fixed! I have released the fixed version (v1.1.5) which you may install using (pip) or by downloading the source code. Hopefully it will work well now. Please don't hesitate to let me know if the problem persists.

Some technical explanation of the problem Clust employs three base clustering methods (k-means, hierarchical clustering, and SOMs). Hierarchical clustering consumes a lot of memory and causes a memory error when applied to very large datasets. I have forced Clust to skip using it for large datasets as there is enough input from k-means and SOMs already. This should solve the problem.

In the future I plan to incorporate more modern and sophisticated clustering methods (e.g. MCL) within my Clust framework in the future to enrich its inputs and therefore to enhance its results. I will make sure memory efficient methods will be chosen for this purpose.

Good luck Basel

phpeters commented 7 years ago

Hej Basel,

thanks a lot for the quick reply and fix!

I actually had it run on a machine with 1TB memory and the system log didn't show any sign of memory overflow. But I retry with both the new and the old version and monitor it and come back with some numbers.

Have a nice weekend! Philipp

phpeters commented 7 years ago

Hej Basel,

Just to let you know, I monitored the memory usage in the v1.1.4 run and at the time of the crash only ~200 GB of the 1TB RAM were used.

Philipp

BaselAbujamous commented 7 years ago

Hey Philipp

Thank you for sharing this information with me. I noticed that the system somehow detects that it will not be able to handle it before the memory is in reality fully used.

Has v1.1.5 worked well? :)

Best wishes Basel

phpeters commented 7 years ago

Hej Basel,

thanks for this info, I better try only subsets of my data though. v1.1.5 worked just fine! I just changed the "largest_DS"-variable a bit for my purposes.

Thanks a lot! Philipp