cmaclell / concept_formation

Python implementations of TRESTLE, COBWEB/3, and COBWEB
MIT License
61 stars 18 forks source link

Incomplete installation? #52

Closed ThomasHoppe closed 5 years ago

ThomasHoppe commented 7 years ago

After installation with 'pip install -U concept_formation' it seems that some parts are missing, i.e.

visualization-files/ visualize.py

is this intended?

cmaclell commented 7 years ago

Good question. In what context are you trying to use these files?

@eharpste has been working on building out the visualization capabilities and I probably haven't pushed these changes into the latest pip build.

ThomasHoppe commented 7 years ago

I couldn't find a method for displaying the structure of the tree in a simple form. Pretty_print outputs too much information and I like to get a quick overview of the structure (something like a tree structure with the concept_id, number of children, probability of the node would suffice). I downloaded the visualization manually and got it to run already (you may have noticed also my pull request). I find it quite handy for inspection of the tree.

My goal/idea is to extend the software (which by the way looks quite good) for a research project with another attribute type (i.e. numerical vectors) and so I am currently evaluating it.

cmaclell commented 7 years ago

Ya, the pretty_print outputs all the attribute value information, which can definitely get verbose pretty quickly.

I'll take a look at the pull request when I get the chance.

Also, can you say a bit more about what a numerical vector attribute looks like? @eharpste and I have been trying to come up with a good set of primitive attribute types and I'd definitely be interested in learning more about attribute types you find useful.

ThomasHoppe commented 7 years ago

Ok, I would like to experiment with clustering in high-dimensional spaces (e.g. 200 or so dimensions). I can assume that each vector component is a i.i.d normal (indentical, independent, normal distributed) float. Of course, I could set up 200 or so numerical attributes, but their handling would be awkward. The math of handling those vectors incrementally corresponds to the usual numerics of cobweb/3, and getting the probabilities and CU right is possible. I already implemented it with another processing approach and the numerics seem to work.

By the way I have some experience with a cobweb/3 implementation in Prolog 20 years ago of a classmate :-D (http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/learning/systems/learn_pl/cobweb/0.html)

cmaclell commented 7 years ago

Hmm.. Your case is still a bit unclear to me. How is it different from how Cobweb/3 handles numerics?

Our Cobweb3Tree will handle the numerical attributes incrementally using a modification of the Cobweb/3 approach. It uses Knuth's algorithms for incrementally updating a mean and std at each node (so the exact numerical values don't need to be stored in the nodes as in the CobwebTree). We've also removed the acuity measure by introducing a small amount of gaussian noise to the CU calculation for numerics (for details see, http://christopia.net/blog/machine-learning-with-mixed-numeric-and-nominal-data-using-cobweb3). Finally, we introduced online normalization, so numerical attributes are scaled using the mean and std maintained at the root of the tree. This later modification helps ensure that all attributes are treated equally in the CU calculation.

Also, very cool about your cobweb/3 implementation! I'll have to check it out :)

ThomasHoppe commented 7 years ago

Not different at all, but just more convenient.

Everything from the handling of numerical attributes carries over to vectors. So instead of incrementally updateing single means and stds, vectors are used and updated instead. Instead of plain numerical operations, numpy operations on vectors can be used (I suppose handling a large number of numerical attributes as numpy operations can be more efficient, but that depends on your implementation and needs to be measured).

The point is that instead of handling n numerial attributes and values, just one attribute with n numerical values is used (easier naming). And I think, it could make some difference for determinig the best update operation (I haven' thought in detail about it yet).

And still I have the idea maybe to go away from the assumption, that the numeric values are normal distributed and to allow instead a more general t-distribution, but that's the future of the future.

cmaclell commented 7 years ago

Ah I see! It would definitely make things more efficient to use vector representations. I've considered trying to store numpy matrices for the probability tables in each node. Then, incorporating an entire instance into a node would consist of a single matrix operation (e.g., adding two matrices of counts) rather than iterating through each attribute value. However, this is one of those things that I just never got around to exploring. Also, things get really tricky in situations where attribute values are missing or where it is unclear what all the attribute values are (you have to wait to observe each pair).

I do think it would be interesting to us other distributions like t-distributions rather than normal. We currently use a unbiased normal std calculation (https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation), which the std for sample size according to a t-distribution. However, this is still assuming a normal distribution and we don't have any kind of support for other distributions.

cmaclell commented 7 years ago

Oh well. I looked through that wiki link and apparently it doesn't update according to a t-distribution, my bad. Still, we do try to correct for bias due to sample size.

ThomasHoppe commented 7 years ago

Well, I wouldn't go that far to turn all attributes into probability tables (that seems to be ako CPDs, conditional probability distributions as described by D. Koller and N. Friedman in Probabilistic Graphical Models), because of the reasons you mentioned. As long as the datatype "numerical vectors" is used, I can be sure all vector components have a value and their number is known in advance.

Concerning the t-distribution, I am inspired by John Kruschke's Bayesian estimation supersedes the t test, where he argues that a t-distribution is more general if we don't know the distribution in advance and it could be heavy-tailed (actually I assume that's often the case with measurements in certain domains). Since in general we can't assume how the real distribution will look like, assuming a t-distribution will be on the save side. Of course, we need to figure out, whether and how the third parameter $\nu$ (the degree of how close the t-distribution is to a gaussian) can be derived and managed incrementally. That's the reason why it's the future of the future.

eharpste commented 7 years ago

RE: the original question on this issue I think its worth looking at getting a new cleaned up version up on pip. @cmaclell and I have been doing a lot of internal development off the master branch for a while and haven't kept up releases to pip as much as we probably should.

cmaclell commented 5 years ago

Created issue #56 to address @ThomasHoppe 's suggestion of adding numpy vectors. Also, I think the issue of not having the latest version on Pypi may have already been addressed, but maybe not. I went ahead and created issue #79 to make sure we push another version to Pypi.

With these new issues that pull out the specific changes we want to make I'm going to close this issue.