SheffieldML / GPy

Gaussian processes framework in python
BSD 3-Clause "New" or "Revised" License
2.01k stars 558 forks source link

Deal with non-float inputs (for R-convolution kernels) #65

Closed beckdaniel closed 12 months ago

beckdaniel commented 11 years ago

Hi all,

I'm trying to implement R-convolution kernels (Haussler 1999) in GPy. These kernels are not defined on [R x R] but rather on sets of discrete structures. They don't expect a pair of reals (or real vectors) as input values but rather a pair of arbitrary structures (like strings, trees or graphs, for example) and measure the similarity between them.

I'm having trouble implementing them because there is code which expects real values (or vectors) before actually applying the kernel. For example the predict function in gp.py normalize input values. However, there's no such normalization defined for strings or trees.

I was able to run a toy version of Tree Kernels (Collins & Duffy 2001, a type of R-convolution kernels) by just commenting the normalization line in the predict function. However, it only works when I use it "alone", without combining it with other kernerls. My input data is composed of trees and also real vectors so my main goal is to use a kernel that combines (adds) a Tree Kernel (applied to the "tree" feature of my data) and a RBF Kernel (applied to the "real vector" features).

It's not possible to add the two kernels in the regular (non-tensor) way because they are defined on different sets. So, one possible solution would be to tensor add them (the resulting kernel would be defined on something like [R x Tree] x [R x Tree]). But the problem is that if I define a kernel this way, my input data cannot be represented as a regular Numpy array anymore.

One possible solution is to change the code so it can be able to cope with Numpy structured arrays. This is what I'm trying to do by now but I'm having a bit of a hard time having to adapt all the code (and it is quite hacky by now...). I noticed the "input_slices" parameter and I thought I could use it for splitting my input data into the "tree" part and "real vector" part but I'm not managing to have slices work with structured arrays.

Another point is that R-convolution kernels still output real values as result. So, in theory, you should not need to tensor add the kernels to define a big kernel on [R x Tree] x [R x Tree], but rather calculate the "tree part" on the [Tree x Tree] kernel, the "real vector part" on the [R x R] kernel and sum the results at the end. So, another solution would be to define a way to have "multiple inputs" (X vectors), one for each kernel part.

I'm not sure which one is the best solution though. And any of them seems to require a big amount of change to the code base (maybe big enough to justify the decision to not implement them?).

mzwiessele commented 9 years ago

We are still aware of this and looking into including pandas objects into the kernel structure. This has not been done by now, though.

We are aiming at GPy 0.10.0 for this, but no promises!

mzwiessele commented 9 years ago

Did you try to put the tree part in its own dimension and have your kernel only work on this dimension active_dims=[<TreeDim>]? then only this kernel should see the tree dim. What does your tree dim look like anyway? what format do you use?

beckdaniel commented 9 years ago

Hi @mzwiessele

I roughly remember using active_dims to ensure my kernel would work only on the corresponding dimensions. But even doing that was not enough, I think there was some scaling/normalization code that was called before applying the kernel and this code expected floats.

But this was a long time ago, maybe the new master version does not have that code anymore. In fact, I merged the latest master into my tree kernel branch but I can't recall if this was fixed or not. I will try to check that.

beckdaniel commented 9 years ago

Ok, apparently there is a new master version. I will try to merge with my tree kernel branch and see what happens.

beckdaniel commented 9 years ago

I managed to merge the latest master version and yes, it seems to be working (by using active_dims). I didn't have time to do a more throughly testing on my branch but from my part I think this issue can be closed (at least for now).

mzwiessele commented 9 years ago

It would be nice to have the result in GPy, so when you finish the Kernel and some tests please put in a pull request, so we can merge it in : )

beckdaniel commented 8 years ago

The kernel is already finished, tests included, but the code is far from elegant and readable...

My main issue is parallelisation. Tree kernels can be quite expensive to calculate, to the point that it surpasses the gram matrix inversion as the speed bottleneck. This can be easily alleviated by calculating the gram matrix in parallel but I had a lot of trouble implementing this....

The current implementation is in Cython and uses a "prange" to do things in parallel. The main problem is that all the code inside prange cannot deal with Python objects (since it releases the GIL). So, my current solution to this is to copy all the data structures I need into STL containers before running the kernel calculation. It works, but it is far from the most elegant solution: in fact, the kernel calculation looks much more like C++ than Python/Cython code... There are probably more clever solutions to this but I need to put some thought on it. An alternative solution that I am also considering is to forget about Cython and go straight for pure C++ code + Python wrapper.

If there is still interest in this then I can do some cleaning and put a pull request. But I strongly advise to leave this in a separate branch instead of merging it into the master branch (btw, is it possible to do this in Github? A pull request for a new branch?).

Another point that might be interesting to give it a thought is how much Tree Kernels (and other structural kernels) actually fit into the toolkit. Even if you manage to merge this into GPy, there is a high risk that this might turn into orphan code in the future, specially if none of the core developers knows the code in depth. I remember seeing a discussion about this in the scikit-learn mailing list. A nice alternative is, let's say me, write a separate toolkit (like a suite of Tree Kernels) with GPy wrappers and then the GPy webpage pinpoints to my toolkit. This enables potential users of new kernels (in this case) while not adding any maintenance burden on the core developers.

mikecroucher commented 8 years ago

Hi Daniel

Something I'd like to see in all GPy code is that every accelerated piece of code also has an unaccelerated, pure-python implementation included.

The reason for this is that the pure-python versions may well be slow but they tend to be more understandable. I think they have several uses: They serve as a useful starting point for other people to try alternative acceleration techniques They serve as a fallback position for those who experience compilation problems They can be used to test the accelerated code

However your code is included in GPy (or otherwise), would you mind including such implementations please?

Cheers, Mike

On 24 September 2015 at 14:32, Daniel Beck notifications@github.com wrote:

The kernel is already finished, tests included, but the code is far from elegant and readable...

My main issue is parallelisation. Tree kernels can be quite expensive to calculate, to the point that it surpasses the gram matrix inversion as the speed bottleneck. This can be easily alleviated by calculating the gram matrix in parallel but I had a lot of trouble implementing this....

The current implementation is in Cython and uses a "prange" to do things in parallel. The main problem is that all the code inside prange cannot deal with Python objects (since it releases the GIL). So, my current solution to this is to copy all the data structures I need into STL containers before running the kernel calculation. It works, but it is far from the most elegant solution: in fact, the kernel calculation looks much more like C++ than Python/Cython code... There are probably more clever solutions to this but I need to put some thought on it. An alternative solution that I am also considering is to forget about Cython and go straight for pure C++ code + Python wrapper.

If there is still interest in this then I can do some cleaning and put a pull request. But I strongly advise to leave this in a separate branch instead of merging it into the master branch (btw, is it possible to do this in Github? A pull request for a new branch?).

Another point that might be interesting to give it a thought is how much Tree Kernels (and other structural kernels) actually fit into the toolkit. Even if you manage to merge this into GPy, there is a high risk that this might turn into orphan code in the future, specially if none of the core developers knows the code in depth. I remember seeing a discussion about this in the scikit-learn mailing list. A nice alternative is, let's say me, write a separate toolkit (like a suite of Tree Kernels) with GPy wrappers and then the GPy webpage pinpoints to my toolkit. This enables potential users of new kernels (in this case) while not adding any maintenance burden on the core developers.

— Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/65#issuecomment-142928199.

beckdaniel commented 8 years ago

Hi Mike

100% agreed. Yes, my branch not only has pure python implementations but also versions using different algorithms =). And yes, some of my tests actually rely on comparing results from the pure python versions to the accelerated ones.

mikecroucher commented 8 years ago

Perfect! On 24 Sep 2015 15:18, "Daniel Beck" notifications@github.com wrote:

Hi Mike

100% agreed. Yes, my branch not only has pure python implementations but also versions using different algorithms =). And yes, some of my tests actually rely on comparing results from the pure python versions to the accelerated ones.

— Reply to this email directly or view it on GitHub https://github.com/SheffieldML/GPy/issues/65#issuecomment-142943594.

beckdaniel commented 8 years ago

Hi all,

Sorry to bring this back but it seems that I am again having trouble with non-float inputs... The issue seems to be in the new "paramz" package, which is enforcing X to be an array of floats:

Traceback (most recent call last): File "string_kernel_tests.py", line 75, in test_profiling_1 m = GPy.models.GPRegression(X, labels, kernel=k)

File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/paramz/parameterized.py", line 48, in __call__ self = super(ParametersChangedMeta, self).__call__(*args, **kw)

File "/home/daniel/GPy/GPy/models/gp_regression.py", line 36, in __init__ super(GPRegression, self).__init__(X, Y, kernel, likelihood, name='GP regression', Y_metadata=Y_metadata, normalizer=normalizer, mean_function=mean_function)

File "/home/daniel/GPy/GPy/core/gp.py", line 44, in __init__ else: self.X = ObsAr(X)

File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/paramz/core/observable_array.py", line 59, in __new__ obj = np.atleast_1d(np.require(input_array, dtype=np.float64, requirements=['W', 'C'])).view(cls)

File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/numpy/core/numeric.py", line 686, in require arr = array(a, dtype=dtype, order=order, copy=False, subok=subok) ValueError: could not convert string to float: Sears Famous Kenmore Completely Automatic Washer.It's like magic—food-It, set it and forget it. Washes all kinds of clothes amazingly clean, automatically. Rinses ail clothes 7 times, automatically. `

Is there any reason for this enforcement? Can't it just inherit the dtype from the original array? Should I open an issue in the paramz repository?

mzwiessele commented 8 years ago

You are perfectly right! It should just inherit the dtype!

Am 06.02.2016 um 15:19 schrieb Daniel Beck notifications@github.com:

Hi all,

Sorry to bring this back but it seems that I am again having trouble with non-float inputs... The issue seems to be in the new "paramz" package, which is enforcing X to be an array of floats:

Traceback (most recent call last): File "string_kernel_tests.py", line 75, in test_profiling_1 m = GPy.models.GPRegression(X, labels, kernel=k) File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/paramz/parameterized.py", line 48, in call self = super(ParametersChangedMeta, self).call(args, _kw) File "/home/daniel/GPy/GPy/models/gp_regression.py", line 36, in init super(GPRegression, self).init(X, Y, kernel, likelihood, name='GP regression', Y_metadata=Y_metadata, normalizer=normalizer, mean_function=mean_function) File "/home/daniel/GPy/GPy/core/gp.py", line 44, in init else: self.X = ObsAr(X) File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/paramz/core/observablearray.py", line 59, in new * obj = np.atleast_1d(np.require(input_array, dtype=np.float64, requirements=['W', 'C'])).view(cls)** File "/home/daniel/.virtualenvs/gpy-env/local/lib/python2.7/site-packages/numpy/core/numeric.py", line 686, in require arr = array(a, dtype=dtype, order=order, copy=False, subok=subok) ValueError: could not convert string to float: Sears Famous Kenmore Completely Automatic Washer.It's like magic—food-It, set it and forget it. Washes all kinds of clothes amazingly clean, automatically. Rinses ail clothes 7 times, automatically.

Is there any reason for this enforcement? Can't it just inherit the dtype from the original array? Should I open an issue in the paramz repository?

— Reply to this email directly or view it on GitHub.

beckdaniel commented 8 years ago

I did a pull request on paramz. Although I'm not entirely happy with the solution, it seems to work.

mzwiessele commented 8 years ago

It was fixed in paramz, can we get a pull request in, so that we can have this beautiful idea in GPy? Structured inputs are a big step towards natural language processing/graph mining/gene sequence analysis/etc... using GPs.

beckdaniel commented 8 years ago

I appreciate the interest but I didn't touch the code since my comment last September... All the issues I mention in that comment are still there... It should also be merged with the last devel, which will require some modifications as well.

If you absolutely want this regardless of its state I can do the pull request. It can be auto-merged with master but not devel. Ideally though, it should be put on a separate branch, in my opinion.

lawrennd commented 8 years ago

@mzwiessele is this something we can get Thomas to look at too? I'm guessing it may be a bit too involved for him. @beckdaniel, we really appreciate the work you've done on it, we'll see if we can find a smooth way to integrate.

mzwiessele commented 8 years ago

We fixed the issues on paramz, so it should now be fine? I would love to allow for different input methods in GPy as it greatly improves usability for datascientists.

Can you please check which problems actually persist? As I thought we handled a lot of them since September.

On 02 Mar 2016, at 11:23, Daniel Beck notifications@github.com wrote:

I appreciate the interest but I didn't touch the code since my comment last September... All the issues I mention in that comment are still there... It should also be merged with the last devel, which will require some modifications as well.

If you absolutely want this regardless of its state I can do the pull request. It can be auto-merged with master but not devel. Ideally though, it should be put on a separate branch, in my opinion.

— Reply to this email directly or view it on GitHub.

beckdaniel commented 8 years ago

The issues are mainly on my side. Just to be more clear, there are two types of structural kernels I am working on:

A general thing about structural kernels is that they tend to be quite expensive to compute. With these kernels, calculating the Gram matrix can quickly end up being more expensive than inverting it. On the other hand, the Gram matrix calculation can be easily made in parallel. Also, parts of these kernels can be vectorized but it is not so trivial to do that due to their recursive definition.

My view is that there's not a lot of point in using these kernels if we cannot make their Gram matrix calculation be at least as fast as the inversion and I don't see a way to reach that without parallelization. This is what I tried to for the Tree Kernels using Cython: it worked but the code turned into a big mess because it required me to convert Python data structures into C++ containers. I am not happy with this solution at all but I am not sure which alternative would result in a cleaner and more maintainable code.

mzwiessele commented 8 years ago

Thanks for the update, I will put it into a later release : )

mathDR commented 8 years ago

Is there a status update on this? I am currently looking at doing GP regression with mixed (categorical and continuous) inputs and would love to see this functionality!

darthdeus commented 5 years ago

@mathDR Just ran into this. I realize that it is a few years later, but how did you handle mixed data? I'm interested in trying this approach https://arxiv.org/abs/1706.03673 which basically rounds the discrete dimensions before they are applied in the kernel, but I'm really not sure how to implement such thing in GPy.

ekalosak commented 12 months ago

Stale.