Add support for Hive LLAP

pavgra commented 5 years ago

Replacement patterns + tests

schuemie commented 5 years ago

You're welcome to add the replacement patterns and tests, as long as you're also willing to provide support on those.

In my humble opinion, I would prefer supporting a few SQL dialects well, rather than trying to support everything but not being able to do so well. Does Hive LLAP really bring something to the table that the nine platforms we already support don't?

gklebanov commented 5 years ago

hi Martijn - all valid points! And no simple answer, of course.

Yes, Odysseus will support Hive LLAP implementation going forward, at least for the time while it is being utilized by the requesting organization.

While I do totally understand - and agree - with a desire to minimize the number of dialects that we should support as OHDSI, but new additions are being driven by ATLAS success and adoption and internal company standards. Most of the organizations are being driven by the Enterprise IT standards and thus have very little choice (or none) of forcing a specific type of the database. Some use RedShift, some Cloudera Impala and some Google BigQuery etc... While we - in OHDSI - can make a decision to support only core standards (and we need to decide on what those are), but should allows others to pick up support of the standards that are required by them.

So, the Hive LLAP is being driven by a request from an organization that is migrating from Netezza to HortonWorks - and there is not other option for them. Now, the good news is that once Hive LLAP is implemented, we can start deprecating Netezza.

In the future, especially for ATLAS 3.0 - we should think of adopting a more modular approach to the database support where different implementations are more decoupled and do not affect each other.

It is an important question which we discussed with Patrick multiple times as well. We should definitely go back to that one again in the near future all together.

schuemie commented 5 years ago

My two cents:

The core platforms are the three we supported from the start: PostgreSQL, Oracle, and SQL Server. Throughout OHDSI we should commit to supporting those.

I've taken ownership over PDW and RedShift, because those are used in my company. I will provide support on those.

Others have added other platforms over time, although it is not always clear who provides support for those. If Odysseus wants to own LLAP, then that is fine by me. Any pull request that will come in I will only check on whether it breaks anything outside of LLAP. This is already similar to how things work for for example BigQuery, although we have hit some grey areas there. For example, someone wanted to throw away all code previously contributed by Google for BigQuery, and I prevented that although I don't see Google coming back to provide support.

How deprecation works I don't know.For example, I'm sure there are many more people using Netezza than we are aware of.

gklebanov commented 5 years ago

The core platforms are the three we supported from the start: PostgreSQL, Oracle, and SQL Server. Throughout OHDSI we should commit to supporting those.

So, I would ask how it was decided to pick those 2 commercial platform as "core" ? Probably because the initial OHDSI contributors were heavy users of them. But frankly, except for one institution who use OMOP CDM - I do know anyone else who is currently using Oracle. Still lot of SQL Server. But the difficulty of committing to support these though OHDSI (and I consider us as OHDSI) is that in OHDSI we do not even have a license for these or even an available SQL Server or Oracle instance to test against. So, the reality is that those platforms are also being driven by some users who happened to be active in OHDSI. So, I would say that open source should support open source databases and the rest - we should just explicitly name the organizations within OHDSI that use and pledge to support of specific commercial products. This is why I proposed we talk more to better define the model.

As of now, Odysseus will provide support for Google BQ (we worked with Google to fix it up and continue to work closely), Impala, Hive LLAP and we are also very heavy users of RedShift and PostgreS. Now, we also had to fix a few things to get Oracle working at one of the institutions that happened to be an Oracle user. And whoever removed working BQ code - definitely should not have happened like that.

How deprecation works I don't know.For example, I'm sure there are many more people using Netezza than we are aware of.

Personally, we know where Netezza code came from and helped to clean it up and integrate it. That same organization is the one moving to Hive LLAP. Plus one more company where I happen to previously work in and I doubt they use it now. So, good question - how do we know if there is someone else. The easiest it so us to reach out to these first and then ask on forums. Might not actually be a need to deprecate - we will stop supporting it since we do not have any users and see if anyone screams and wants to pick it up.

ideally, we should have some stats somewhere on what platforms are being used across OHDSI - so, yes, a very valid question!

pavgra commented 5 years ago

For example, someone wanted to throw away all code previously contributed by Google for BigQuery, and I prevented that although I don't see Google coming back to provide support.

The someone was me. The someone attempted to do that since BQ was the only dialect for which you had special classes and special processing differing from any other DB. Therefore, having the best wishes for clean code structure I tried to clean that up. And prevented that is only partially applicable since I've clearly shown why a significant part of that code doesn't make sense, you agreed and the code was dropped.

schuemie commented 5 years ago

@gklebanov : we do have testing instances for Oracle and SQL Server in OHDSI, and for example FeatureExtraction and Achilles test against these in continuous integration.

Yes, we may revisit whether to support Oracle in the future. But again, I think you will be surprised to learn how many people run OHDSI tools against Oracle. Having user statistics would be very helpful indeed. Maybe we could set up a survey?

@pavgra : I didn't mean to reopen that discussion, just wanted to point out that things are not always black and white, and decisions are sometimes made on a case-by-case basis.

pavgra commented 5 years ago

Why cannot we just use the same strategy as we decided to apply for Atlas? - enable plugin approach, allow others to extend the library / load custom translation patterns / etc. In that way, anyone could add whatever he/she wants and do not face these "standard vs non-standard" debates as those additions could be stored as a separate SqlRender module outside of the repo. @schuemie , do you have any concerns with me proposing and implementing some hooks and functions to allow better SqlRender extensibility?

chrisknoll commented 5 years ago

Why cannot we just use the same strategy as we decided to apply for Atlas? - enable plugin approach, allow others to extend the library / load custom translation patterns / etc.

Because this could become a support nightmare. People would come to the OHDSI forums asking questions about some third-party plugin where we don't know the ownership and there's no support obligations from anyone and it leads to a lot of tensions and uncertainty of the tools.

I do not want to hijack this thread into debating the pros-cons of the plugin architecture, or establish some governance rules surrounding the ownership of these plugins. I'm just pointing out that adding a plugin mechanism isn't a silver bullet for this particular Issue (to add support for Hive LLAP), and @schuemie 's point about ownership and support still stands: whether this is a plugin architecture or people collaborate together on a common codebase, we still need to have clear ownership and support structure for these sub-modules. If we do want to approach this as a 'plugin problem', then there are certain fundamental decisions that are required (what is the set of official plugins, who supports those plugins, where do people go for support of the plugins, how are these plugins documented).

pavgra commented 5 years ago

Because this could become a support nightmare

First, you express concerns on whether Hive LLAP should be supported by OHDSI. Second, I say: ok, let us then code this in our separate workspace, do not throw not required code into the main repo - we just need extension points for this. Then you answer: no, extension points are not welcomed since we need to support plugin. First and last sound a bit contradictive to me. Do not treat this message as an offense, I'm just trying to figure out a way for development which will make everyone happy.

chrisknoll commented 5 years ago

@pavgra , i only made one post in this whole thread, so I don't see how you can present me arguing with you that I said..and then you said..and then I said which is some kind of contradiction.

Weighing pros and cons is all about opposites: on one hand it lets third party extensions at random. On the other hand, people only go to one place (OHDSI) for support on these tools. So, some of the decision here should be bearing in mind how these things are supported. Today, with taking PRs into one repo, all this gets managed and accepted through a single channel, and if there's any problems, we go to one place to discuss it. With a more 'distributed' approach, we bypass the 'single point of entry' but now there's concerns about how this is managed and supported. I'm not offering a solution here, I'm asking you to just consider this perspective. At a high level, the notion of plugging in SQLRender providers to support different platforms sounds good. The challenge is indicating what we support out of the box, how to make these plugins available, how people submit new plugins, etc. So it's just not 'hey lets go to a plugin architecture'. There's a lot of benefits to the current approach from a governance and support perspective that is a negative from a 'third-party extensible' perspective. Let's not make snap decisions on jumping on a particular bandwagon without considering all the angles.

schuemie commented 5 years ago

As far as I can see, SqlRender is already pretty 'modular': translation rules only impact their respective dialects. However, as package owner I still need to make sure any proposed changes fit with what was already there. I don't see a way around that.

The idea that we can make an architecture that makes every aspect completely and absolutely independent is I think wishful thinking.

Right now this focus on supporting every possible platform is driving attention away from providing good functionality that is easy to use.

schuemie commented 5 years ago

For comparison, I2B2 only seems to support PostgreSQL, Oracle, and SQL Server: https://community.i2b2.org/wiki/display/getstarted/2.2+Database+Requirements

gklebanov commented 5 years ago

@pavgra - Chris and Martijn are making a valid point in regard that in any design - current or more modular - the first question is who is going to support each dialect.

@schuemie - there are definitely ways to build a more modular design but you are correct it is not that black and white.

The database requests are being driven by the existing organizational architecture standards and in most - if not all - cases where we are involved the initial OHDSI "standard" database choices are not available and thus led to requests to add more so that people can use an existing and supported infrastructure. Buying licenses for Oracle or SQL Server for most is not an option, never mind they are very expensive (and never mind that with a scale of data that is being hosted today - they are probably not the best choices when it comes down to scalability). We have a choice - to do it and thus drive a wider ATLAS/OHDSI adoption, or not do it and thus not to many support ATLAS/OHDSI opportunities. Yes, it is more effort to add new support for more dialects - and I do share a similar concerns - but if organizations like Odysseus are ready to provide support for that additional dialect - similarly to how Martijn currently provides support for RedShift and PWD outside of core OHDSI standards - that sounds like a reasonable proposition?

we do have testing instances for Oracle and SQL Server in OHDSI, and for example FeatureExtraction and Achilles test against these in continuous integration.

Great! Is it possible to share that info? That would help us to ensure we are doing better test coverage when releasing new ATLAS and going through a test process.

Having user statistics would be very helpful indeed. Maybe we could set up a survey?

I think something like this would definitely be useful.

guys, this is a great conversation to have - including standards databases to be supported by OHDSI, support process for non-standard and modular design etc.. - I propose we take the key points shared here and revisit at some point when we are discussing the new OHDSI ATLAS 3.0 design and OHDSI architecture in general?

OHDSI / SqlRender

Add support for Hive LLAP #173