OHDSI / SqlRender

This is an R package and Java library for rendering parameterized SQL, and translating it to different SQL dialects.
https://ohdsi.github.io/SqlRender
Other
81 stars 77 forks source link

Interested in SqlRender Algorithm #37

Closed hajarhomayouni closed 7 years ago

hajarhomayouni commented 7 years ago

Dear developers,

I am really interested in the algorithm you used for translating from Sql server to other dbms' formats. Do you have any publications/ documents describing the logic behind this translations.

I mean what kind of parser algorithm is used ?

schuemie commented 7 years ago

Hi @hajarhomayouni ,

I'm sorry, we don't have anything documenting the algorithm. It doesn't perform a full parse of the input SQL. Rather, it really is just a search and replace for a pattern, with some extra code to make sure the pattern stays at the same level of nesting. For example, when searching for the pattern SELECT @a FROM @b WHERE, it will not match the nested WHERE with the first SELECT in SELECT a FROM (SELECT a FROM b WHERE a = 1) temp WHERE a != 2;

The meat of the algorithm is here.

We are running into the limitations of this approach (e.g. a search pattern cannot begin or end with a wildcard). Probably version 2.0 will use full parsing, and depart from SQL Server SQL as starting dialect. Instead, the input dialect will be some SQL dialect specifically designed for SqlRender.

Cheers, Martijn

hajarhomayouni commented 7 years ago

Thank you Martjin. Could you please tell me what are the possible values for a parameter starting with @? Actually, I am trying to use SqlRender to translate my queries into GBQ which does not support + for string concatenations and it just supports CONCAT function. On the other hand, as you mentioned I cannot add '@a'+@b to the replacement pattern table directly because patterns cannot start or end with a wildcard (@b). So I decided to consider all possible endings for this pattern (e.g. '@a'+@b, ) and add them to the replacement patterns. But I know the precedence of considering this ending points matters for SqlRender. And I know that @b can include many things, including other symbols. So if I add a pattern like '@a'+@b, in higher level than '@a'+@b+ in the replacement patterns table, the following would be translated incorrectly:

Select 'a'+b+c,d from table; translation: Select CONCAT('a',b+c),d from table.

I mean I wanted b to be my right hand side operand while the whole b+c is considered the right hand side operand because it matches the first pattern '@a'+b, To overcome this problem I added all 2 permutation of n possible symbols to this table. for example I added both patterns '@a'+@b+@c, which considers [+ ,] and also added '@a'+@b,@c+ which considers [, +]. Although I did not consider the order of these combinations to each other which should matter for SqlRender (first put [, +] or [+ ,] ?), but it is working for any combinations I test!!! I was wondering why it is working if it is just search for the first match in the patterns? for example in the table I put [ ) +] in higher level than [+ ) ], but it is translating the follwing pattern correctly!! ('a'+b+c)+d translation: CONCAT('a',b,c,d)

Thanks, Hajar

chrisknoll commented 7 years ago

What happens if you do ( (@a + @b) + @c)? I'm not sure how SQL render handles nesting of expressions, but i do know it does handle things like:

select a, b, c from T1 where T1.a in (select i from T2 where T2.col = @someVal)

Where the nested query requires it's own translation sweep.

If sql render sees the following parts of the first expression:

First find (@a +@b) and translate that (i think) into concat(a,b) then above that we have ( {expr} + @c) would become concat(concat(a,b),c).

I'm just not sure if the SQL render nested parsing works this way.

-Chris

Sigfried commented 7 years ago

I've also been wondering about a couple sqlRender issues. The VA, I believe, is "hardening" the code, which I think means they are getting @variables to be passed through the jdbc connection (with prepared SQL statements) to prevent SQL injection attacks. So that would, presumably, get in the way of SQLRender using @variables to do query text replacement prior to submitting (preparing) the query. Is there any plan to allow the community at large to take advantage of this and have access to prepared query parameters while still using SQLRender?

The second issue I'm wondering about is related to Hadoop and other DBMSs that use a version of SQL too distant from standard SQL to allow for reasonable use of SQLRender translation (at least with certain features). I understand that SQLRender already allows for individual queries to be hand-translated to specific SQL flavors rather than undergoing SQLRender translation. I'm wondering if someone who understands the ins and outs might produce a guide or a little documentation of what a developer wanting to add support for one of these DBMSs would need to do to get (say) PostgreSQL translations by default but then to override translation for specific queries.

Thanks!

schuemie commented 7 years ago

So many questions! Let me start with Sigfried's first:

By definition, SqlRender has parameterized SQL, that is the whole point of SqlRender (next to the translation bit). I'm curious to see how the VA is getting it to work with prepared statements instead. Could you share this?

The SqlRender R package indeed allows manual translations (I'm not sure about how WebApi, which also uses SqlRender, handles this). The idea is pretty simple: in the inst/sql folder of a package, we currently only have a sql_server subfolder, as you can see for example here in the FeatureExtraction package. Just create an extra folder with the name of your dialect (e.g. 'hadoop'), and make sure SQL files with the same names as in the sql_server folder exist in your folder, with the correct translation. Obviously, the preferred way is to use automatic translation instead. Or consider using one of the five DBMSs already supported instead ;-)

schuemie commented 7 years ago

Responding to @hajarhomayouni next:

I'm not sure I fully understand your question. One slightly dirty (but powerful) thing SqlRender allows for is recursive patterns. If my search pattern is '@a' + '@b', and my replacement pattern is CONCAT '@a', '@b' , if I apply it to this SQL:

SELECT 'a' + 'b' + 'c', x FROM y;

First, the first match will be replaced:

SELECT CONCAT 'a' , 'b' + 'c', x FROM y;

then, the pattern will be searched and replaced again:

SELECT CONCAT 'a' , CONCAT 'b' , 'c', x FROM y;

In this case probably not what you want to achieve....

If you were wondering about the first five patterns in replacementPatterns.csv, these use this recursiveness to translate something like this

INSERT INTO my_table (key,value) 
VALUES (1,0),(2,0),(3,1);

into this

INSERT ALL 
  INTO  my_table (key,value) VALUES (1,0) 
  INTO  my_table (key,value) VALUES (2,0)
  INTO  my_table (key,value) VALUES (3,1) 
SELECT * FROM dual

Hope this helps

Sigfried commented 7 years ago

Hi Martijn,

By definition, SqlRender has parameterized SQL, that is the whole point of SqlRender (next to the translation bit). I'm curious to see how the VA is getting it to work with prepared statements instead. Could you share this?

I'm afraid I'm kind of ignorant about what SQLRender actually does. I thought that when it came across @variables in a query, it did a text substitution with the parameter value into the query itself. Is that wrong? Does it leave the @variables in place, prepare the query, and pass the param values through JDBC to the prepared query?

I don't really know what the VA is doing, but I thought (based on hearsay) that their "strengthening" had to do with this issue.

Thanks.

schuemie commented 7 years ago

Hi Sigfried,

No, you're right about what SqlRender does. That is what I would call parameterization of SQL. So in theory, if the application SQL looks like

SELECT * FROM @cdmSchema.table;

I could in theory abuse that through code injection by running the application with cdmSchema = '; TRUNCATE TABLE blah; DROP TABLE blah;' or something similar, and SqlRender would create SQL that could damage the data. (Note to worried readers: you can only do this if your admin had already given you the rights to do the damage in the first place).

You are right: normally the rendered SQL would no longer contain @variables.

Sigfried commented 7 years ago

Thanks for the clarification Martijn. I do wonder what the VA folks are doing and if they'll share with the larger community.

It does seem like it might have been an unfortunate choice for SQLRender to use @ signs, making it not easy to combine SQLRender parameters and JDBC query parameters in the same query. If both were available, in many circumstances it might be hard to decide which to use, but it would be nice in places where either could work to leave it up to the programmer to decide:

chrisknoll commented 7 years ago

@schuemie , To answer your question bout how they are paramaterizing:

I believe that in the code parts they have changed, they look for the places where we substitute schemas and parameters, and they split it into 2 passes. The first pass does the replace of the schemas and the if {} blocks, and the second pass looks for any remaining parameters and replaces it with '?'. They keep track of which parameters were replaced with ? and use the prepared statement syntax to set the values in the query. So, for example:

select col1, col2, col3 from @schema.table where col4 = @value

becomes:

select col1, col2, col3 from my_schema.table where col4 =?

And they code in the logic that sets the values of the injected '?' to the correct values:

// assume query has the value of the prepared statement
// assume there is a set of values stored to put into the prepared statement
jdbcTemplate.execute(query, new PreparedStatementCallback<Boolean>() {
 @Override
 public Boolaean doInPreparedStatement(PreparedStatement ps)
 {
  ps.setString(1, col4Value);
  return ps.execute();
 }
});

Basically this works in simple, static scenarios pretty easily, and my understanding is that they modified the places where the queries are called in a very non-obtrusive way so it's an easy pattern to follow (the above probably isnt' what it looks like I think they have a 1 or 2 liner to replace the existing code).

Also, they have only done this in the very simple, static places we make queries (in other words, places where the WebAPI end points directly call out to the database. So, not in any background jobs or places like cohort definitions. Apparently this is enough to get through their code checks.

-Chris