Cool things in Malloy - Githubissues

max-sixty commented 2 years ago

I've been spending some time looking at Malloy, a language from Looker that looks promising. I've been a fan of @lloydtabb's for a while, and briefly chatted to him about Malloy a couple of weeks ago.

As ever with these "better SQL" languages — our fates are positively correlated. So I hope it's collegial to explore the things Malloy does particularly well at.

Without a full review of the differences — Malloy seems more focused on a broader vision of data modeling with structure on top of existing tables — like explores — blocks that combines tables — and dimensions & measures (these are things I always liked in Looker). The language integrates an approach for defining charts. There's less emphasis on pipelines — ordering isn't semantically important — though there is a way to pipe queries into other queries. It's more typing to get started.

Some things that look compelling at first glance:

Inline filters:
```
ny_flights is count() { where: origin.state = 'NY' }
```
compiles to
```
COUNT( CASE WHEN origin_0.state='CA' THEN 1 END)  as ca_flights
```
I'm not sure how general these can be; presumably they can't be full where clauses like Malloy's normal where clauses.
Small one — or syntax: where: carrier: 'UA' | 'AA'

Nesting — similar to the discussion in our readme, this is executed quite well, with a nest block giving an array. So:

query: airports -> {
group_by: state
aggregate: airport_count
nest: by_facility is {
  group_by: fac_type
  aggregate: airport_count
}
}

...produces a nested array (SQL is mangled so showing the result)

[
{
  "state": "TX",
  "airport_count": 1845,
  "by_facility": [
    {
      "fac_type": "AIRPORT",
      "airport_count": 1389
    },
    {
      "fac_type": "HELIPORT",
      "airport_count": 435
    },
    ...

Joins — https://looker-open-source.github.io/malloy/documentation/language/join.html I rather like that it limits SQL's complexity — most joins in practice are left joins using primary keys. So this is nice:
```
explore: flights is table('malloy-data.faa.flights') {
join_one: origin_airport is airports with origin
join_one: destination_airport is airports with destination
join_one: aircraft with tail_num
}
```
This works because the primary keys has been defined above (similar to what we discussed on discord around defining relationships separately, reducing verbosity for inline joins):
```
explore: airports is table('malloy-data.faa.airports') {
primary_key: code
}
```
It is possible to do less standard joins as an escape hatch.
Timestamps — small one — they prepend timestamps with @, like { where: dep_time: @2003 to @2005 }, and have attribute access with functions (which seem to be both prefix and suffix?) dep_time.month / day_of_year(some_date)

That's all for now. I think it'd be good to keep in touch with the Malloy team and would welcome any cross-pollination.

max-sixty commented 2 years ago

Malloy have continued to add some great features, check out https://looker-open-source.github.io/malloy/documentation/language/expressions.html

brycefisher commented 2 years ago

@max-sixty I've spent a lot of time writing Kusto queries for Azure Data Explorer, and love the ability to tack on | chart .... to get a visualization in supported gui clients. I also love that the visualization information is kept separate as metadata returned from Azure Data Explorer servers, so each client can do their own rendering. This kind of eliminates the need for some amount of Jupyter notebooks because you can easily visualize results inline with your query using standardized syntax.

Are you consider some kind of results visualization for PRQL or is that outside the scope of this project, given the architecture?

max-sixty commented 2 years ago

Thanks for the comment @brycefisher ! I agree that's really useful for some workflows. As well as Malloy, Juttle does this well.

I'm definitely open to including something like this. The ideal structure at the moment might be to have some way of extending PRQL, and let people build on top of it — maybe with importable functions, or a view transform that was fairly unopinionated within PRQL.

To set expectations — at the moment I don't have a clear vision for what something good would look like, so it probably won't come from me without lots of people saying that PRQL's success is dependent on it — but that doesn't mean I think it's a bad idea!

How do you think about this?

brycefisher commented 2 years ago

Agreed! I wouldn't take one random persons request as an automatically good roadmap item.

I think time series analysis and easy ways of manipulating time in particular benefit from easy visualization. It probably makes sense to consider visualization related features as you mature time functions and time series.

max-sixty commented 2 years ago

Check this out: https://docs.google.com/presentation/d/18KUl_rrz2K-hbsiKJYS3rtTcYxZMXKklyPllLmTtIYY/mobilepresent#slide=id.g1269816dcbe_0_179

Some really cool stuff — it's has some excellent features around exploring data — in particular:

the indexing of dimension values, so browsing the dataset shows info like top values
the interaction from the results to the query — clicking a row of values can add those values as a filter in the query
having a View — e.g. chart / json — within the query (similar to @brycefisher 's suggestion). I'm sure people will be divided on this, but some sort of basic charting can be v helpful when exploring, even if you don't want to always couple these
while I found the interface for nesting confusing (requires "duplicating" the item, in the video), Malloy does lean heavily on nesting, and it seems good. As discussed elsewhere, we can do this easily with an as | array or to | array (or maaaybe a nest rather than a group, if we want to make the shape of the data obvious to people reading)
Malloy's SQL generation seems to have improved a lot

I'm admittedly much more of a fan of the pipeline concept from a language POV — here I get confused as to whether the order of the clauses matters. Indeed, I'm not sure it could be explained simply, and e.g. adding a filter after an aggregate doesn't seem possible (because it doesn't yet support HAVING as a different language item!).

But the browsing & exploring here is really great.

aljazerzen commented 2 years ago

That is really great for data exploration.

It would be awesome if we could join forces and create a language agnostic tool for such exploration. Relational/dataframe model has been around for years and we are just inventing new ways for querying it, so I don't see why we couldn't reuse the UI that works with relational models and uses any of the 3 query languages.

max-sixty commented 2 years ago

Very much agree @aljazerzen .

I don't want to fall into the mindset of looking at different ideas and saying "nah PRQL better" — but in this case, the benefit in constructing the query arguably come from arbitrarily adding transforms — and this is the thing that PRQL is great at given its pipeline, and Malloy is less strong at. e.g. the having example above. (Though SQL is terrible at, because adding an arbitrary transform can require a new CTE and query restructure.)

My vision for PRQL is that we would probably not build these sorts of things ourselves, but because we're an open and compatible language, we'd be the best tool for others building these sorts of interfaces...

The thing that Malloy starts on that PRQL doesn't is a semantic model for data — how the information maps to reality — e.g. what's a Dimension and what's a Measure, how is data related, how are metrics defined, etc. And that allows some cool features that contribute to Composer's strengths, like indexing dimensions and easier joins.

PRQL / prql

Cool things in Malloy #71