Closed phpeters closed 7 years ago
Hi Philipp
Thank you very much for reporting this issue, which I have fixed! I have released the fixed version (v1.1.5) which you may install using (pip) or by downloading the source code. Hopefully it will work well now. Please don't hesitate to let me know if the problem persists.
Some technical explanation of the problem Clust employs three base clustering methods (k-means, hierarchical clustering, and SOMs). Hierarchical clustering consumes a lot of memory and causes a memory error when applied to very large datasets. I have forced Clust to skip using it for large datasets as there is enough input from k-means and SOMs already. This should solve the problem.
In the future I plan to incorporate more modern and sophisticated clustering methods (e.g. MCL) within my Clust framework in the future to enrich its inputs and therefore to enhance its results. I will make sure memory efficient methods will be chosen for this purpose.
Good luck Basel
Hej Basel,
thanks a lot for the quick reply and fix!
I actually had it run on a machine with 1TB memory and the system log didn't show any sign of memory overflow. But I retry with both the new and the old version and monitor it and come back with some numbers.
Have a nice weekend! Philipp
Hej Basel,
Just to let you know, I monitored the memory usage in the v1.1.4 run and at the time of the crash only ~200 GB of the 1TB RAM were used.
Philipp
Hey Philipp
Thank you for sharing this information with me. I noticed that the system somehow detects that it will not be able to handle it before the memory is in reality fully used.
Has v1.1.5 worked well? :)
Best wishes Basel
Hej Basel,
thanks for this info, I better try only subsets of my data though. v1.1.5 worked just fine! I just changed the "largest_DS"-variable a bit for my purposes.
Thanks a lot! Philipp
Hej Basel,
thanks for publishing clust! I gave it a try and for the examples and for a gene set of size 30000 it worked just fine ot of the box. Great!
However, when I tried with a gene set with approx. 70000 genes (with 11 replicates in total and for 5 time points), clust threw an error in step 3, the Bi-CoPaM method. I attached the error log below. Running the command with more than 1 cpu creates a longer, but similar error.
Do you have an idea how to get rid of it?
Thanks and best regards! Philipp
`clust Data/ -n Normalisation.txt -r Replicates.txt -o results/
/===========================================================================\ | Clust | | (Optimised consensus clustering of multiple heterogenous datasets) | | Python package version 1.1.4 (2017) Basel Abu-Jamous | +---------------------------------------------------------------------------+ | Analysis started at: Thursday 11 May 2017 (14:40:58) | | 1. Reading datasets | | 2. Data pre-processing | | 3. Seed clusters production (the Bi-CoPaM method) | Traceback (most recent call last): File "/software/Clust/clust.py", line 6, in
main(args)
File "/software/Clust/clust/main.py", line 100, in main
args.q3s)
File "/software/Clust/clust/clustpipeline.py", line 97, in clustpipeline
ncores=ncores)
File "/software/Clust/clust/scripts/uncles.py", line 380, in uncles
(Xloc[l], Ks[ki], Ds[ki], methodsDetailedloc[l], GDMloc[:, l], Ng) for ki in range(NKs))
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call
while self.dispatch_one_batch(iterator):
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init
self.results = batch()
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/software/Clust/clust/scripts/uncles.py", line 313, in clustDataset
tmpU = cl.clusterdataset(X, K, D, methods) # Obtain U's
File "/software/Clust/clust/scripts/clustering.py", line 24, in clusterdataset
U[ms] = chc(X, K, methodsloc[ms][1:])
File "/software/Clust/clust/scripts/clustering.py", line 73, in chc
Z = sphc.linkage(X, method=linkage_method, metric=distance)
File "/software/python/Python2.7/lib/python2.7/site-packages/scipy/cluster/hierarchy.py", line 669, in linkage
int(_cpy_euclid_methods[method]))
File "scipy/cluster/_hierarchy.pyx", line 740, in scipy.cluster._hierarchy.linkage (scipy/cluster/_hierarchy.c:9172)
File "scipy/cluster/stringsource", line 1281, in View.MemoryView.memoryview_copy_contents (scipy/cluster/_hierarchy.c:23661)
File "scipy/cluster/stringsource", line 1237, in View.MemoryView._err_extents (scipy/cluster/_hierarchy.c:23211)
ValueError: got differing extents in dimension 0 (got 336196312 and 2483679960)
`