IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

proposal for streamsx.stargate, for interacting with hbase via the its rest API #71

Closed hildrum closed 8 years ago

hildrum commented 8 years ago

I'm proposing streamsx.stargate, a repository for functions for doing puts and gets into HBase via the rest API. The name of a the rest server bundled with HBase is stargate, hence the name. More info on the HBase rest server here.

An initial version of the streamsx.stargate repository is under my account. If IBMStreams decides to accept this proposal, it can fork that toolkit rather than creating a new repository.

Inside are two namespaces:

These have been tested and work against HBase on Bluemix.

Why not in the hbase toolkit? Here's why my initial version isn't the hbase toolkit:

ddebrunner commented 8 years ago

It's unusual that the namespace in stargate is not com.ibm.streamsx.stargate, is it an issue that it shares a common root with a different project?

ddebrunner commented 8 years ago

Should com.ibm.streamsx.base64 functionality be in streamsx.transform? They are not specific to HBASE rest api (stargate) and if stargate already depends on another toolkit, making it depend on two isn't much difference.

hildrum commented 8 years ago

Naming the repository streamsx.hbase.rest would be awkward, I thought. I considered making the namespace com.ibm.streamsx.stargate, to match the current repository name, but that's less descriptive. So, I decided on keeping the single-word repository name, but to use a descriptive toolkit name and namespace.

On the base64, I had originally thought of putting com.ibm.streamsx.base64 into the inet toolkit since base64 is sometimes used in internet protocols, but it could go in transform instead. I ended up putting it here because I don't know of any other applications that would use it, and I figured it could be moved if another application appeared.

chanskw commented 8 years ago

In my understanding, this is a contribution that contains a SPL file, using Inet toolkit HTTP* native functions. We are wrapping the calls to make it easier to access HBase using the REST APIs.

I think the code here is useful, but unsure if we should spin off a new repository for this.

If this depends on the inet toolkit, would it be more appropriate to show this as an utility or sample in the Inet toolkit... and demonstrate how we can use the Inet toolkit to access HBase via Stargate?

When we see a need for a new repository, or as stargate matures, we can spin off a repository at that time?

Similarly, we can do the same thing to demonstrate how to access HDFS using the Inet toolkit.

hildrum commented 8 years ago

I understand where you're coming from, and I had the same thoughts.

But the decision about whether these belong in their own toolkit should be made without regard to how they are implemented. It should not matter whether there's 1000 lines of .cgt (or java or cpp, or splmm, or spl). The decision should be made on whether they fit and how they are intended to be used.

I don't believe that functions for accessing the HBase (or Hdfs) belong in the inet toolkit proper. I could move the base64 stuff into streamsx.inet because encoding in base64 is something you do for internet transfer in a general context. But if HBase-specific stuff belongs in the Inet toolkit itself, then anything using a REST API does.

I also don't intend them as samples. I expect them to be used as-is, not as a basis for people to write their own, similar functions. Packaging them as samples would make them much harder to use that way. If these are samples, someone who wants to use these functions would have to point SPLPATH to the samples directory of the Inet toolkit (-t streamsx.inet/samples/com.ibm.streamsx.hbase.rest), or they'd have to copy of the Stargate.spl file into their application

I'm not at all opposed putting this into a broader repository; I considered proposing a streamsx.bluemix repository, but in the end I didn't because (1) there's nothing else I'd put in it at the moment, and (2) this could be used by someone outside the context of bluemix.

chanskw commented 8 years ago

If I think about this in the broader scope, HDFS also supports webhdfs. But Webhdfs support is part of the HDFS operator, and not a completely separate toolkit. I am still not sure we should have a separate toolkit / repository for this. If this is for HBase support, then it should be part of the HBase toolkit.

Can we use these operators to access HDFS, or is this HBase specific?

ddebrunner commented 8 years ago

I think people looking how to get a Streams app to talk to HBase would naturally look in the hbase toolkit, therefore if this is just an api to tlak to hbase it should be in the hbase toolkit.

hildrum commented 8 years ago

If I put them in the HBase toolkit and do nothing else, then the HBase toolkit in the product will no longer work out of the box, since it would require the non-product inet toolkit.

If we put them in the HBase toolkit, then we probably also need to do one of these things:

Of those options, my least favorite is the last option. I think we should avoid introducing the dual-maintenance scenario for the HBase toolkit.

mikespicer commented 8 years ago

I agree that the HBase toolkit is the natural place for this functionality and would like to avoid Kris's option 1 above, which makes 2 & 3 the current favorites.

ddebrunner commented 8 years ago

Not sure the layout should be influenced by a short term situation, that the product is not shipping the latest inet toolkit.

chanskw commented 8 years ago

Also, is this really something that we want to promote in the long term? Do we really want to provide utilities to "encourage" people to access HBase via the stargate? This is one way to access HBase for Bluemix, but is this really the optimal way and is this something we want to encourage? Can the jetty server handle the speed and volume from Streams?

I am inclined to think that the underlying connection / access mechanism to HBase and HDFS should be transparent to the end-user, whether stargate is in place or not. Ideally, the HBase operators should somehow detect that they have to go through the stargate gateway to access HBase, and the customer can simply just use the same set of operators.

mikespicer commented 8 years ago

" Ideally, the HBase operators should somehow detect that they have to go through the stargate gateway to access HBase, and the customer can simply just use the same set of operators." Agree with @chanskw on this, but realize we don't live in an ideal world. Would this be doable?

hildrum commented 8 years ago

@chanskw I assume you're talking about jetty being a potential problem because knox is built on jetty? I didn't think there was an alternative to the REST APIs via the knox gateway for interacting with HBase on Bluemix. What's the alternative?

@mikespicer What would be ideal is if we could pick up some java package that wraps the REST calls so that we can use the same API that our HBase operators already use, but it would behind the scenes go through the REST API. There is something like this for webhdfs/hdfs, but I haven't turned up something that does this for HBase.

This means we are stuck doing a lot of work ourselves. We need to take streams objects to strings suitable for the http call, then take the http call result and translate it back into streams objects. The HBase toolkit is the product of some 200 commits over some 1 1/2 years (and it wasn't the first HBase toolkit), so I figure building the REST API support for everything the HBase toolkit does is likely to be a substantial effort.

Given that we don't have a pile o' developers to assign to the task (or even one developer--this not my "assignment"), I think we should do this work in a way that facilitates staging in features as needed, and hopefully makes it possible for non-experts to map directions (like these curl command line directions) to streams functions or operators to build what they need into their Streaming application.

I think building them as spl functions does that. We could instead shove the REST API logic (build String, do call, parse String) into the current Java operators instead of building it into spl functions. This would be great from the user's perspective, until the day the user realized that what was delivered in the first pass didn't quite work, and went in to try to modify it and found they have to know the difference between a MetaType and a Type to navigate the code.

ddebrunner commented 8 years ago

This is one way to access HBase for Bluemix, but is this really the optimal way and is this something we want to encourage?

I'm not sure providing a mechanism is necessarily encouragement, this is open source, someone had an itch to provide access to Hbase through stargate, and wanted to share it. Seems a useful addition to the toolkit.

chanskw commented 8 years ago

I am not against accessing HBase for Bluemix via Stargate. I was wondering about performance impact and I have the following two main concerns with this approach:

1) Creating a new repository for this - Currently, the only way to access HBase for Bluemix is via Stargate. I am hoping this may be a temporary situation. In the long run, perhaps we may be able to access it via the normal HBase client. In that case, I do not see long term viability of this repository. I also think it will confuse user as to where to get this HBase support.

2) I am concerned with the approach to create new operators to access HBase, just because there is a stargate gateway in between. This is not a very user-friendly approach. In addition, once we introduce these operators into the field, we have to live with them for a long time, and we cannot remove them. So, I want to make sure that we are going in the right direction, and in the long term, if this is something we want to do. Having a new set of operators pose additional maintenance work, and any new feature we add to the HBase operators, we may end up having to do it twice, because the operators are separate.

Like I said, I think access HBase via Stargate is useful. I am mainly concerned as to how we are doing it, and what the user experience is like.

Having said that, I understand that we do not live in the ideal work, and it is a lot more work to try to combine the two sets of operators. I agree with Kris that we need to be able to stage this support and make incremental progress.

So, I would like to focus on that, how do we make progress and allow support for HBase for Bluemix. Can we do this?

1) I think we agree that this should go into the HBase toolkit. The fact that it requires the master branch from Inet toolkit is tricky. Can I propose that we create a feature branch for these operators in the HBase toolkit? Kris can contribute it there. 2) Build an Alpha release with the new operators in the HBase toolkit, so people can get to these features. 3) We get feedback from the community about these operators. It may be missing some features that are available in the other HBase operators. And we can gather more requirements there to see what's important. 4) Slowly work towards combining the two sets of operators in the future. Or hopefully we do not need these operators when we can access HBase for Bluemix some other way.
5) When we gather enough feedback about this approach, or when we are able to work on these operators to combine the two sets, then we can talk about integrating this into the master branch. This also allows us to work on the Inet toolkit, and potentially adding the required native functions into the product release.

Please let me know what you think about this plan. Thanks...

hildrum commented 8 years ago

It sounds like the consensus is that these belong in streamsx.hbase. I'm okay with that.

Given that, my preference would be to make the master branch of the inet toolkit be the one included in the product.

My second preference would be to deliver the http* functions to the 2.0 branch of the inet toolkit. This introduces some versioning problems. Adding functionality means incrementing the second number, so logically then the version on that branch would 2.1, but then if I declare a dependency on com.ibm.streamsx.inet v2.1, it would seem like the operators would work with com.ibm.streamsx.inet v2.5, but they won't, since it needs 2.7 if you're looking on what's in the master branch.

I am not okay with putting them in a feature branch in streamsx.hbase. I like the feature branch pattern, and I find it very useful for adding functionality. However, what you're proposing is that they these functions stay in a feature branch for an undetermined length of time, essentially splitting the streamsx.hbase toolkit. I don't think that's a good pattern. I certainly haven't found it comfortable in dealing with with streamsx.inet toolkit. If the way they can go under IBMStreams is in what amounts to a permanent feature branch on streamsx.hbase, then I'll keep them out of IBMStreams and under my own account.

ddebrunner commented 8 years ago

I think we agree that this should go into the HBase toolkit. The fact that it requires the master branch from Inet toolkit is tricky.

Maybe I'm missing something, but why is this tricky?

chanskw commented 8 years ago

Only saying it's tricky from the product perspective. We need to think about more about what to do with the Inet toolkit.

chanskw commented 8 years ago

Feature branch is a common pattern for features to be incubated, until it is ready to be integrated in the master branch. When we integrate depend on when we have time to add the new native functions into the product inet toolkit. It also gives us time to gather feedback and harden support for HBase bluemix.

I think having it there is a much better approach than leaving these operators in your private clone. It also allows us to build alpha release so people can try this out.

mikespicer commented 8 years ago

Do we have an estimate of how soon we would be able to pull in the necessary elements of the inet toolkit into a product release? And could we agree that a branch approach is not intended as a way to orphan these and that the branch would be pulled in to the product version of the base toolkit once the net toolkit has the necessary function. A caveat would be if we had managed to implement the ideal single set of operators that could handle both cases but I suspect that won't happen by then.

mikespicer commented 8 years ago

One comment on the base64 encoding. It occurred to me that the streamsx.bytes project was created for these kind of byte manipulation functions and it does have decodeBase64 and encodeBase64 functions. We should either use those as is or contribute to the streamsx.bytes toolkit to include the functionality you need.

hildrum commented 8 years ago

The hbase functions can use the streamsx.bytes as-is--whatever the final version of this, I'll make sure it does that. Thanks Mike, I didn't know the functions were already available.

hildrum commented 8 years ago

Updated as Mike suggested to use streamsx.bytes. It seems to work fine, the only problem is that streamsx.bytes decodeBase64 doesn't have a way to signal an error on the decode. Also, I suspect it may not work on power. I entered issues in streamsx.bytes for each of those items.

chanskw commented 8 years ago

Closing, we merged this into the hbase toolkit