cudbg / sql2pandas

Turn SQL into pandas statements
https://cudbg.github.io/sql2pandas
MIT License
0 stars 0 forks source link

Sort as one-liner #6

Open sirrice opened 2 years ago

sirrice commented 2 years ago
# SELECT a+1, d*3, g FROM data ORDER BY b
df = data.sort_values(["b"], ascending=[1])
df = df.assign(a0=(df.iloc[:,0]) + (1.0),a1=(df.iloc[:,3]) * (3.0),g=df.iloc[:,6])[['a0', 'a1', 'g']]

Generate instead

(df
  .sort_values(["b"], ascending=[1])
  .assign(a0=(df.iloc[:,0]) + (1.0),a1=(df.iloc[:,3]) * (3.0),g=df.iloc[:,6])[['a0', 'a1', 'g']]
)

pandastutor - i manually rewrote it as a one-liner

sirrice commented 2 years ago

Make ascending argument to sort_values use bool True/False rather than 1/0. Figure out general way to translate Python list/dict into printable strings.

sirrice commented 2 years ago

This requires identifying "pandas pipeline breakers" -- sequences of operators in a query plan pipeline that can be inlined into a single statement. Could do this by augmenting compiler with a chain type method. Something like:

ctx.add_line("df = ...")
tmp = ctx.chain("df2", "df")  
tmp.call("foo(a,b)")
tmp.call("bar(c)")
tmp2 = ctx.chain("df3", "df2")
tmp2.call("baaz(d")

would generate:

df = ...
df2 = (df
  .foo(a,b)
  .bar(c))
df3 = df2
  .baaz(d)

We then need a way to pass the chained tmp or tmp2 from child to parent. Could pass chain objects instead of variable names in ctx['df']?