cube2222 / octosql

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Mozilla Public License 2.0
4.74k stars 201 forks source link

Relaxed mode? #298

Open chapmanjacobd opened 1 year ago

chapmanjacobd commented 1 year ago

I really like how fast octosql is and it seems like a great tool to bring unstructured data into a strict hierarchical, columnar format.

Unfortunately, wrangling unstructured data usually requires some flexibility and interpretation. I have two main issues so far:

1) Selecting variables which are in some files but not others:

⭐ It would be really nice to have a --relaxed flag where unknown variables are silently replaced with null

unzstd --memory=2048MB --stdout RS_2008-02.zst | octosql "SELECT overridden_url, valid_column FROM stdin.json" -o csv

Error: typecheck error: unknown variable: 'overridden_url'

Ideally, that would:

2) No way to cast booleans to int?

octosql "SELECT cast(false as int)"
Error: typecheck error: typecast of non-union type expression 'Boolean'

octosql "SELECT int(false)"
Error: typecheck error: unknown function: int(Boolean)

For example, in sqlite:

sqlite-utils memory -t 'select cast(false as int) as hello'
  hello
-------
      0
chapmanjacobd commented 1 year ago

also, what's going on with len() ?

octosql "SELECT * FROM docs.function_signatures"
...
| 'len'            | []                       | 'NULL'        | true   | false            |
| 'len'            | []                       | 'NULL'        | true   | false            |
| 'len'            | []                       | 'NULL'        | true   | false            |
| 'len'            | ['String']               | 'Int'         | true   | true             |
...

octosql "SELECT len((0,1,2,3))"
+-------+
| col_0 |
+-------+
|     4 |
+-------+

seems to work fine...

octosql -v
octosql version dev

also, you might also want to store/print the git commit during compile time or something, for example:

'octosql version dev' + git describe --match=NeVeRmAtCh --always --abbrev=8 --dirty && git status --short

cube2222 commented 1 year ago

Hey!

So, multiple things you mentioned:

relaxed flag

The use case makes sense, I'll think about it.

No way to cast booleans to int

Good point, happy to add this.

what's going on with len()

When the simple_signature is false, then the arg types / return types have no meaning. But it's a good point that this way it's completely unreadable right now what's supported and what isn't. I'll see if I can add "synthetic" signatures there.

len works with strings, lists, tuples, and objects right now.

cube2222 commented 1 year ago

No way to cast booleans to int

This is now added in https://github.com/cube2222/octosql/commit/cc43f257bfe318a5f93ca336728c13443aba6c7f and released in 0.11.1.