Error on Toy data - Githubissues

GGiecold-zz / ECLAIR

Robust and scalable inference of cell lineages via consensus clustering. Features novel algorithms for the comparison of weighted graphs and unrooted trees.

MIT License

14 stars 4 forks source link

Error on Toy data #3

Closed falexwolf closed 6 years ago

falexwolf commented 7 years ago

Hi!

I tried running this toy data X_krumsiek11.txt, which stores 640 observations of 11 features. The data forms a clearly tree-like manifold that I would expect ECLAIR to resolve.

ECLAIR though seems not to be able to treat this. Any help on the terminal outputs below is appreciated.

Cheers, Alex

Terminal output (1): Running the interactive interface produced the error at the very end of terminal output below.

(py27) Alexs-MacBook-Pro:comparison_eclair alexwolf$ python -m ECLAIR.Build_instance
*****************************************
*****************************************
***             ECLAIR                ***
*****************************************
*****************************************

ECLAIR: provide the path to the file holding the data to be analyzed:
./X_krumsiek11.txt                          

ECLAIR: how may rows count as header in this file? Enter '0' if the file is not adorned by any header:
1

ECLAIR: which column of the data-file holds the names, tags or IDs of its samples? Enter '0' for the 1st column, '1' for the second, etc.:
0

ECLAIR: does this data-set include some time information? [Y/n] 
n

ECLAIR: you may choose to exclude some columns as features. If this option does not apply, simply press 'Enter'. Otherwise, provide a list of numbers:

ECLAIR: please give an estimate of the number of samples in this data-set:
600

ECLAIR: please enter the number of trees that will be bagged into a forest (a value of '50' is recommended):
50

ECLAIR: how many points do you want to sample from the dataset? Please provide a fraction of the total number of cells:
0.8

ECLAIR: choose the clustering algorithm to be applied to each of 50 subsamples from your data-set.
Available methods: affinity propagation (1), DBSCAN (2), hierarchical clustering (3) & k-means (4)

4

ECLAIR: how many centroids to generate for each run of k-means clustering?
n

ECLAIR: invalid entry; please correct by providing a positive integer:
5

ECLAIR: the total number of consensus clusters defaults to the highest number of clusters encountered in each of the 50 independent runs of subsamplings and clusterings. Do you want to provide a value instead? [Y/n]
n

ECLAIR   INFO    2017-09-14 10:33:50: ready to proceed!

ERROR: Density_Sampling: density_sampling: 'desired_samples' has been assigned a value of 512, larger than 494, the number of samples whose local densities are high enough (i.e. excluded are the local densities in the lowest 0.01 percentile).

Terminal output (2): Running the file using the command-line interface produced the following the error, even though the file correctly formatted,

(py27) Alexs-MacBook-Pro:comparison_eclair alexwolf$ python -m ECLAIR.Build_instance X_krumsiek11.txt 
Traceback (most recent call last):
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/__main__.py", line 624, in <module>
    main()
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/__main__.py", line 582, in main
    opts, args = parse_options()
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/__main__.py", line 452, in parse_options
    type = 'list', help = ("Specifies which features to remove, "
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 1013, in add_option
    option = self.option_class(*args, **kwargs)
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 578, in __init__
    checker(self)
  File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 661, in _check_type
    raise OptionError("invalid option type: %r" % self.type, self)
optparse.OptionError: option -e/--excluded_columns: invalid option type: 'list'

GGiecold-zz commented 7 years ago

miniconda3 is Python 3 based, which ECLAIR does not support.

Gregory

On Sep 14, 2017 4:47 AM, "Alex Wolf" notifications@github.com wrote:

Hi!

I tried running this toy data X_krumsiek11.txt https://github.com/GGiecold/ECLAIR/files/1302335/X_krumsiek11.txt, which stores 640 observations of 11 features. The data forms a clearly tree-like manifold that I would expect ECLAIR to resolve.

[image: image] https://user-images.githubusercontent.com/16916678/30420410-c59bca4a-9939-11e7-8af1-8833c225e7af.png

ECLAIR though seems not to be able to treat this. Any help on the terminal outputs below is appreciated.

Cheers, Alex

PS: Running the interactive interface produced the error at the very end of terminal output below.

(py27) Alexs-MacBook-Pro:comparison_eclair alexwolf$ python -m ECLAIR.Build_instance

ECLAIR

ECLAIR: provide the path to the file holding the data to be analyzed: ./X_krumsiek11.txt

ECLAIR: how may rows count as header in this file? Enter '0' if the file is not adorned by any header: 1

ECLAIR: which column of the data-file holds the names, tags or IDs of its samples? Enter '0' for the 1st column, '1' for the second, etc.: 0

ECLAIR: does this data-set include some time information? [Y/n] n

ECLAIR: you may choose to exclude some columns as features. If this option does not apply, simply press 'Enter'. Otherwise, provide a list of numbers:

ECLAIR: please give an estimate of the number of samples in this data-set: 600

ECLAIR: please enter the number of trees that will be bagged into a forest (a value of '50' is recommended): 50

ECLAIR: how many points do you want to sample from the dataset? Please provide a fraction of the total number of cells: 0.8

ECLAIR: choose the clustering algorithm to be applied to each of 50 subsamples from your data-set. Available methods: affinity propagation (1), DBSCAN (2), hierarchical clustering (3) & k-means (4)

4

ECLAIR: how many centroids to generate for each run of k-means clustering? n

ECLAIR: invalid entry; please correct by providing a positive integer: 5

ECLAIR: the total number of consensus clusters defaults to the highest number of clusters encountered in each of the 50 independent runs of subsamplings and clusterings. Do you want to provide a value instead? [Y/n] n

ECLAIR INFO 2017-09-14 10:33:50: ready to proceed!

ERROR: Density_Sampling: density_sampling: 'desired_samples' has been assigned a value of 512, larger than 494, the number of samples whose local densities are high enough (i.e. excluded are the local densities in the lowest 0.01 percentile).

Running the file using the command-line interface produced the following the error, even though the file correctly formatted,

(py27) Alexs-MacBook-Pro:comparison_eclair alexwolf$ python -m ECLAIR.Build_instance X_krumsiek11.txt Traceback (most recent call last): File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/main.py", line 624, in main() File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/main.py", line 582, in main opts, args = parse_options() File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/site-packages/ECLAIR/Build_instance/main.py", line 452, in parse_options type = 'list', help = ("Specifies which features to remove, " File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 1013, in add_option option = self.option_class(*args, **kwargs) File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 578, in init checker(self) File "/Users/alexwolf/miniconda3/envs/py27/lib/python2.7/optparse.py", line 661, in _check_type raise OptionError("invalid option type: %r" % self.type, self) optparse.OptionError: option -e/--excluded_columns: invalid option type: 'list'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GGiecold/ECLAIR/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j54BIuJyi2U91edWUrUkN1dBTy4G3ks5siOgQgaJpZM4PXPtd .

falexwolf commented 7 years ago

I'm using a Python 2.7 environment /envs/py27/lib/python2.7/, which works for all other Python 2.7 software and is an established way for dealing with several python versions and configurations.

Moreover, the first error I reported is a runtime error thrown after a pretty long time. It's evidently not related to an erroneous installation.

Help would be appreciated.

falexwolf commented 7 years ago

Instead of resolving the error, I'd also be happy if you could briefly make a prediction for X_krumsiek11_blobs.txt, which is a bit more difficult than the toy data above; but still it is relatively simple simulated data.

I'd like to include a comparison with your method in a paper we want to submit soon. It would be a pity if this was not possible for presumably very simple reasons.

Thank you, Alex

GGiecold-zz commented 7 years ago

Hi Alex.

I'm willing to help with software-related issues or questions pertaining to consensus clustering, the method underlying ECLAIR. As for comparing with biological datasets, 3 of my co-authors are still in academia whereas I've moved on to a different career.

I noticed a density-sampling error message in the log file you provided yesterday. In my experience, this could usually be solved by a mean-variance normalization of the datasets we've been handling.

Gregory

On Sep 15, 2017 1:34 AM, "Alex Wolf" notifications@github.com wrote:

Instead of resolving the error, I'd also be happy if you could briefly make a prediction for X_krumsiek11_blobs.txt https://github.com/GGiecold/ECLAIR/files/1305362/X_krumsiek11_blobs.txt, which is a bit more difficult than the toy data above; but still it is relatively simple simulated data.

I'd like to include a comparison with your method in a paper we want to submit soon. It would be a pity if this was not possible for presumably very simple reasons.

Thank you, Alex

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/GGiecold/ECLAIR/issues/3#issuecomment-329686090, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j5-ORO6hQtrNnSJiZLAxAkUmS9oHaks5sigx9gaJpZM4PXPtd .

falexwolf commented 7 years ago

Normalizing helps! Thank you! :)

Now it seems that running the code requires installing cairo and the python bindings of cairo from source as it requires the igraph plotting functions to work. As there are no precompiled binaries, this is quite a burden on a Mac and I refrained from going through all this.

Is there a way to run the algorithm without plotting? The interactive parameters don't seem to give me an option for this. I'm fine with getting a numeric adjacency matrix in the end, which I can plot myself.

Sorry, to bother you with all this! Still it's worth to put some effort so that I can compare the robustness of your method with many competitors.

PS: If you ran the small matrix yourself, this would only take you a couple of minutes, I guess. Sara provided me with some parameter choices for runnning the code, but hasn't yet responded to the problem of producing a result with toy data.

GGiecold-zz commented 7 years ago

Hi Alex.

Glad to hear confirmation that a simple normalization solves the aforementioned issue.

I'll try to find time over the week-end to process your dataset and send you the resulting plots.

All the best,

Gregory

On Fri, Sep 15, 2017 at 2:15 AM, Alex Wolf notifications@github.com wrote:

Normalizing helps! Thank you! :)

Now it seems that running the code requires installing cairo and the python bindings of cairo from source as it requires the igraph plotting functions to work. As there are no precompiled binaries, this is quite a burden on a Mac and I refrained from going through all this.

Is there a way to run the algorithm without plotting? The interactive parameters don't seem to give me an option for this. I'm fine with getting a numeric adjacency matrix in the end, which I can plot myself.

Sorry, to bother you with all this! Still it's worth to put some effort so that I can compare the robustness of your method with many competitors.

PS: If you ran the small matrix yourself, this would only take you a couple of minutes, I guess. Sara provided me with some parameter choices for runnning the code, but hasn't yet responded to the problem of producing a result with toy data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/ECLAIR/issues/3#issuecomment-329691790, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j5xG3h5ofp67uHl87n0NzwMLkGePLks5sihXxgaJpZM4PXPtd .

falexwolf commented 7 years ago

Hi Gregory,

ok, I commented out all plotting, commented out measuring memory by opening the process info file (which is absent on a Mac and on Windows, psutils would give you a platform-independent way of doing this) and manually installed gpmetis. Then it ran through! :)

Results are not super good so far, but I'm happy to rerun the code or post your results including a logfile; it would be nice to learn about what causes your method trouble. I have an idea but your explanation is probably better. :)

Kind regards, Alex

GGiecold-zz commented 7 years ago

Hi Alex,

Actually memory-management is done via psutil in the latest version of ECLAIR, as present on GitHub. I never could find time or an incentive to update the PyPI package though :-/

I have deployed consensus and ensemble clustering in a variety of contexts. Statistical generalization is usually thereby improved by an appreciable margin.

However, those methods don't address the problem of using an appropriate clustering method and distance metrics in the first place. As such, I have some reservations about the general relevance of k-means and the distributions of topological distances along minimum spanning trees, even though that worked rather well in our paper for the so-called Bendall dataset and comparison with SPADE as a baseline.

Have you tried using DBSCAN? I spent quite some time developing a faster and leaner implementation of it at our former lab, where we used it in a variety of contexts.

Best regards,

Gregory

On Sep 16, 2017 3:19 AM, "Alex Wolf" notifications@github.com wrote:

Hi Gregory,

ok, I commented out all plotting, commented out measuring memory by opening the process info file (which is absent on a Mac and on Windows, psutils would give you a platform-independent way of doing this) and manually installed gpmetis. Then it ran through! :)

Results are not super good so far https://github.com/theislab/graph_abstraction/tree/master/minimal_examples/comparisons, but I'm happy to rerun the code or post your results including a logfile; it would be nice to learn about what causes your method trouble. I have an idea but your explanation is probably better. :)

Kind regards, Alex

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/ECLAIR/issues/3#issuecomment-329951681, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j56J-iDAnLYHmIcRcEBWOVT6BJD2dks5si3aIgaJpZM4PXPtd .

falexwolf commented 7 years ago

Hi Gregory,

I'm using the latest version from GitHub; but I guess all the '/proc/meminfo' related errors are due to dependencies of eclair (e.g. DBSCAN_multiplex, etc.), which are retrieved from PyPI. I also fixed the bugs in them manually.

I've now tried it using DBSCAN and the results don't look better; maybe I misspecified some of the parameters (the interface asks many questions, see what I answered below). If you want to play around with it in order to produce a sensible result, I'm happy to reproduce it and add it to the discussion I linked above. Otherwise, I'll just leave it like that for now. ;)

Thank you for your help! Alex

PS: Here comes the output for the parameter choices.

*****************************************
*****************************************
***             ECLAIR                ***
*****************************************
*****************************************

ECLAIR: provide the path to the file holding the data to be analyzed:
./X_krumsiek11_scaled.txt

ECLAIR: how may rows count as header in this file? Enter '0' if the file is not adorned by any header:
1

ECLAIR: which column of the data-file holds the names, tags or IDs of its samples? Enter '0' for the 1st column, '1' for the second, etc.:
0

ECLAIR: does this data-set include some time information? [Y/n] 
n

ECLAIR: you may choose to exclude some columns as features. If this option does not apply, simply press 'Enter'. Otherwise, provide a list of numbers:

ECLAIR: please give an estimate of the number of samples in this data-set:
600

ECLAIR: please enter the number of trees that will be bagged into a forest (a value of '50' is recommended):
50

ECLAIR: how many points do you want to sample from the dataset? Please provide a fraction of the total number of cells:
0.8

ECLAIR: choose the clustering algorithm to be applied to each of 50 subsamples from your data-set.
Available methods: affinity propagation (1), DBSCAN (2), hierarchical clustering (3) & k-means (4)

2

ECLAIR: you have chosen to perform a Density-Based Spatial Clustering of Applications with Noise on each of 50 samples from the data-set. Unless you decide to manually enter a value for the radius 'epsilon', this parameter - which determining density reachability - will be determined automatically upon inspection of the distribution of pairwise distances for your data-set and based on a choice of metric you will be asked to provide.

ECLAIR: how many points are needed to form a dense region?

ECLAIR: sorry but 'minPts' must be a positive integer; try again:
50

ECLAIR: do you want to provide a value of the parameter 'epsilon' for DBSCAN? [Y/n] 
If not, as is recommended but might take some time, 'epsilon' will be determined in an adpative way from a 50-distance graph.

ECLAIR: I'll take this as a 'no'.

ECLAIR: do you want to specify epsilon as a particular quantile to a distribution of 50-nearest distances? Please answer by [Y/n]. If not, epsilon will default to the median of that distribution.
n

ECLAIR: metric for calculating the distance between instances in your data-set (default would be 'minkowski'):

ECLAIR: this choice is not recognized. Please pick one from:
['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'l1', 'l2', 'mahalanobis', 'manhattan', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sqeuclidean', 'sokalmichener', 'sokalsneath', 'wminkowski', 'yule']
minkowski

ECLAIR: the total number of consensus clusters defaults to the highest number of clusters encountered in each of the 50 independent runs of subsamplings and clusterings. Do you want to provide a value instead? [Y/n]
n

GGiecold-zz commented 7 years ago

Hi Alex,

Thank you for letting me know.

I've had a look at scanpy's code, the corresponding preprint and the comparisons made on various datasets between existing lineage-reconstruction methods. Good and very thorougj job! Feel free to include the comments exchanged in this thread.

With best regards,

Gregory

On Sep 18, 2017 5:11 AM, "Alex Wolf" notifications@github.com wrote:

Hi Gregory,

I'm using the latest version from GitHub; but I guess all the '/proc/meminfo' related errors are due to dependencies of eclair (e.g. DBSCAN_multiplex, etc.), which are retrieved from PyPI. I also fixed the bugs in them manually.

I've now tried it using DBSCAN and the results don't look better; maybe I misspecified some of the parameters (the interface asks many questions, see what I answered below). If you want to play around with it in order to produce a sensible result, I'm happy to reproduce it and add it to the discussion I linked above. Otherwise, I'll just leave it like that for now. ;)

Thank you for your help! Alex

PS: Here comes the output for the parameter choices.

ECLAIR

ECLAIR: provide the path to the file holding the data to be analyzed: ./X_krumsiek11_scaled.txt

ECLAIR: how may rows count as header in this file? Enter '0' if the file is not adorned by any header: 1

ECLAIR: which column of the data-file holds the names, tags or IDs of its samples? Enter '0' for the 1st column, '1' for the second, etc.: 0

ECLAIR: does this data-set include some time information? [Y/n] n

ECLAIR: you may choose to exclude some columns as features. If this option does not apply, simply press 'Enter'. Otherwise, provide a list of numbers:

ECLAIR: please give an estimate of the number of samples in this data-set: 600

ECLAIR: please enter the number of trees that will be bagged into a forest (a value of '50' is recommended): 50

ECLAIR: how many points do you want to sample from the dataset? Please provide a fraction of the total number of cells: 0.8

ECLAIR: choose the clustering algorithm to be applied to each of 50 subsamples from your data-set. Available methods: affinity propagation (1), DBSCAN (2), hierarchical clustering (3) & k-means (4)

2

ECLAIR: you have chosen to perform a Density-Based Spatial Clustering of Applications with Noise on each of 50 samples from the data-set. Unless you decide to manually enter a value for the radius 'epsilon', this parameter - which determining density reachability - will be determined automatically upon inspection of the distribution of pairwise distances for your data-set and based on a choice of metric you will be asked to provide.

ECLAIR: how many points are needed to form a dense region?

ECLAIR: sorry but 'minPts' must be a positive integer; try again: 50

ECLAIR: do you want to provide a value of the parameter 'epsilon' for DBSCAN? [Y/n] If not, as is recommended but might take some time, 'epsilon' will be determined in an adpative way from a 50-distance graph.

ECLAIR: I'll take this as a 'no'.

ECLAIR: do you want to specify epsilon as a particular quantile to a distribution of 50-nearest distances? Please answer by [Y/n]. If not, epsilon will default to the median of that distribution. n

ECLAIR: metric for calculating the distance between instances in your data-set (default would be 'minkowski'):

ECLAIR: this choice is not recognized. Please pick one from: ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'kulsinski', 'l1', 'l2', 'mahalanobis', 'manhattan', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sqeuclidean', 'sokalmichener', 'sokalsneath', 'wminkowski', 'yule'] minkowski

ECLAIR: the total number of consensus clusters defaults to the highest number of clusters encountered in each of the 50 independent runs of subsamplings and clusterings. Do you want to provide a value instead? [Y/n] n

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GGiecold/ECLAIR/issues/3#issuecomment-330165537, or mute the thread https://github.com/notifications/unsubscribe-auth/AK3j54WwfKoDVC2YigAtyHCap0Wyianpks5sjjPIgaJpZM4PXPtd .

falexwolf commented 7 years ago

Thank you for your kind words! :) Alex