elixir-ecto / ecto

A toolkit for data mapping and language integrated query.
https://hexdocs.pm/ecto
Apache License 2.0
6.19k stars 1.44k forks source link

New Feature: Adding default `where` clauses to schema #2395

Closed zachdaniel closed 6 years ago

zachdaniel commented 6 years ago

The start of the discussion is in the google group here: https://groups.google.com/forum/#!topic/elixir-ecto/dLsumjvNJ2I

The proposal for the interface:

    schema "foo" |> where(deleted: true) do
      ...
    end

Updates would need to be made to joins, from, and other places to ensure that these where clauses are always added.

I think the way forward there is to first modify everything to use __schema__(:query), and then set it up such that this change alters that query.

zachdaniel commented 6 years ago

Some alternative syntax perhaps:

schema "foo" do
  where :deleted_at, nil

  condition :deleted_at, nil

  where [deleted_at: nil]
end
michalmuskala commented 6 years ago

To be honest, I'm not so sure about this. I still have nightmares about default_scope from rails.

Querying itself is just one part of the story - what about creating structs and especially associations? Should they be created in a way that satisfies the query?

zachdaniel commented 6 years ago

I wouldn't want it to affect any stateful operations on the schema, but I do know what you mean. Perhaps this is a slippery slope.

zachdaniel commented 6 years ago

Perhaps we could just name it slightly differently? Something like:

schema "thing" do
  from_query [delete: false]
end

to make it clear that it only alters queries made against the resource?

We also would have to make some interesting decisions. Like what about Repo.delete_all(query)? Does that honor the clauses? Or not?

michalmuskala commented 6 years ago

I'm sorry, if I sounded too harsh.

The feature itself does seem valuable, and I often wanted something similar myself. What I'm worried about is exactly the interaction between this feature and all the other features - especially so, since that touches on something that might potentially affect almost everything else in ecto.

Maybe having some conventions around this first could be useful? Something like a Schema.query() function or something similar?

josevalim commented 6 years ago

@michalmuskala one option is to have a read-only schema - i.e. a view. Coincidentally, that's how I have been telling folks to implement this feature so far: create a database view where it shows only non-deleted stuff.

zachdaniel commented 6 years ago

We actually had a very extensive system set up for views, but it became very difficult to manage operationally (we have quite a bit of different API resources scoped in this way) and so we switched to having them all provide a query/0 function and using it everywhere. The views worked okay because you can actually insert into views as of recent versions of Postgres. But it became a pretty huge pain to manage them all.

zachdaniel commented 6 years ago

There are also some limitations when it comes to using a table vs a view. With a table, postgres knows about primary keys and allows you to only group by the primary key, even when you want additional rows that come with the record.

e.g

SELECT 
  posts.id, posts.name, posts.author_id, COUNT(comments.id) 
AS comment_count 
FROM posts
JOIN comments ON comments.post_id = posts.id
GROUP BY posts.id

That above query would work with a table, but it would not work with a view, even if the definition for that view was just SELECT * FROM posts

ericmj commented 6 years ago

The general Ecto philosophy has been to prefer explicit over implicit and avoid surprises. Because of this I think it's better to solve this with a function on the schema module:

def query() do
  where(__MODULE__, [deleted: false])
end
michalmuskala commented 6 years ago

I guess one problem with this is associations and preloads, which by default only load with the base schema. Maybe a way to specify constraints there would be more useful? Something like:

has_many :comments, Comment, query_with: &Comment.query/1

# in comment.ex
defp query(base) do
  where(base, deleted: false)
end
zachdaniel commented 6 years ago

In our case (not that I necessarily have a problem with this) we would by default include that query_with: &AssociatedModule.query/1 on every relationship in our system. It could lead to some pretty strange bug hunting if you add something to the query function but only certain associations honor it.

I still think that it might be the best way though.

Keep in mind also that the ability to change the query of an association will be used for a lot more than this, though, and should be considered in that context. I imagine people trying to use it to add polymorphic associations, and all sorts of other things.

OvermindDL1 commented 6 years ago

Honestly I'm -1 about this whole feature. It's not hard to build a query helper function and use that for all queries with the schema (it's what I do, and it becomes SO much easier to make once we get named joins! ^.^).

zachdaniel commented 6 years ago

@OvermindDL1 While I partly agree with what you're saying, we also face a problem of scale. We have 70k lines of code, and like 150 schemas, all of which express this .query() function. Remembering to call .query() every time you use from or join is easy (thats how we've set it up), sure, but bring that out to the dozens of times a day new queries are written, and then we have really difficult to find bugs because one was left out somewhere. Also the fact that we can't use the assoc helpers make our queries much less readable.

All over we have

from row in SomeSchema.query(),
  join: join_table_row in SomeSchemaJoinOtherSchema,
  on: join_table_row.left_id == row.id,
  join: other_row in ^OtherSchema.query(),
  on: join_table_row.right_id = other_row.id

and we'd really love to be able to use

from row in SomeSchema,
  join: other in assoc(row, :others)
OvermindDL1 commented 6 years ago

@OvermindDL1 While I partly agree with what you're saying, we also face a problem of scale. We have 70k lines of code, and like 150 schemas, all of which express this .query() function. Remembering to call .query() every time you use from or join is easy (thats how we've set it up), sure, but bring that out to the dozens of times a day new queries are written, and then we have really difficult to find bugs because one was left out somewhere. Also the fact that we can't use the assoc helpers make our queries much less readable.

I've made it a point to not use unwrapped from/join/etc... just for such reasons (nor do I use the assoc helpers as most of the time they just don't help yeah...). Most of my queries are of the form of (actual copy/paste):

profiles =
  Banner.query_profile(:processed, [employee: "F", pidms: pidms]
  |> Repo.all()

I have customized functions that sanitize input, accept arguments, etc... Very much like functions in SQL in that they are limited in what they accept but can build a very complex query (I'd prefer now to show one of those functions now as they are on the order of 'bloody huge!' currently, but that will super simplify once named joins lands). There are query's that I can mix and match over time as well (and that will be the form I probably fall back to quite a bit once named joins lands, but without named joins it is almost impossible to write those well without carrying along a state of what has been joined or not and branch based on that and the possible length and order... which is a lot more work currently, but again, won't be an issue with named joins).

But yeah, I just decompose everything to functions. Even inside those big ones are just functions that start with something like query_spriden(...).

Do note, every-single-table in the entire database (it's an ~15 year old oracle database, not under my control, or it would have a lot less stringly data like dates...) has no primary keys (a bit of a wtf I know, but they use the internal oracle id as the primary key and of course use it absolutely no where), has the data and there are columns for a (stringly) date where if it is null then it is the valid value, but if it is not null then it has been superseded by a non-null field or is just entirely deactivated. There is never a time a row is deleted, they are only ever added and a change is only ever adding that removed field to a row. Consequently I have to check that, that field is null on every access (to 100's of tables), and to make it better, only about 80% of the tables follow that format, the others use a 'code' to indicate active or not, a single-char string field where if it is 'A' it is active and if it is anything else it is inactive (except for about 3 tables that use a different 'active' character). Thus every single access of a table right down to it's base most queries of getting it as a root query and the base of adding it as a named join (using a complex set of macro's currently, detailed on the forums, that will be replaced with named queries once it lands as these macro's have factorial time compiling complexity) are all function calls. I don't forget to add a query for a given table because each table has accessor functions for how to access it and thus I just don't use ecto's query syntax except inside those functions and via where(...)'s and so forth (I.E. I never ever use a naked from or join that is unwrapped by functions).

As well as sometimes I still need to get the older versions of the rows (hence why the functions take optional arguments to be able to control how the queries get set up, that way I can acquire older versions like the next oldest, all, etc...).

It is a pain yes, but I don't think putting functions on the schema is the way to do it, rather just putting functions in your query'ing modules (I hate magic calls)...

zachdaniel commented 6 years ago

It sounds like our access patterns are pretty different. All of our resources are standardized and behave the same, and ecto's assoc helper maps very well to our conceptual model of them. The only thing that our resources expose about them that matters is their where clauses and I'd rather stay as "close to the metal" so to speak, as possible and just build standard ecto queries.

I'd also like named joins, but I don't think that they help us with this specific problem.

OvermindDL1 commented 6 years ago

I'd also like named joins, but I don't think that they help us with this specific problem.

Entirely true, it was a bit of side consequence of what I have to work with... ^.^;

For the problem though I'm just saying that standard wrapping functions is the way to go in a functional language. Adding magical callbacks that get called magically and unexpectedly starts reeking of ruby's weird injections all over the place though (and how would you override it, would you have to use a schema without it?), which makes code start to be hard to reason about ("wait what? where is that 'that' coming from in the SQL query, I didn't put it there!").

zachdaniel commented 6 years ago

Yeah, I know what you mean about the danger of magic, but for us really comes down to wanting to automate a global rule in our system. If we can't get it in a fashion like this, I'll probably write a custom from macro that expands assoc inline and alters the source of all of the schemas or something. I think its a pretty easy decision to not use a feature like this if it doesn't fit your design pattern/tastes. I also think its totally reasonable to not implement a feature because it might contribute to poor design in the projects that use Ecto. At the same time though, I can think of at least one perfectly good use case for it (my use case 😛).

OvermindDL1 commented 6 years ago

I think its a pretty easy decision to not use a feature like this if it doesn't fit your design pattern/tastes.

Quite true, but the same cannot be said of libraries that you may have to use or projects that you may inherit, those are what I'm worried about. :sweat_smile:

(or even from me 6 months in the future without touching it for that period...)

michalmuskala commented 6 years ago

What we're looking after here is something similar to database views, but in the application layer instead of the database layer. I don't thing you'll say database views are a "magic" feature. Why not use database views directly? As it was mentioned before in this thread, database views have some limitations: grouping is harder and versioning is painful.

zachdaniel commented 6 years ago

@michalmuskala I like that way of framing the problem. That perfectly sums up how we arrived at wanting this feature as well.

OvermindDL1 commented 6 years ago

@michalmuskala Isn't that precisely what a custom query function is? :-)

For the example given prior:

def query_join_view_only_valid(query) do
  query
  |> join(:left, [row: row],
      join_table_row in SomeSchemaJoinOtherSchema,
      join_table_row.left_id == row.id)
  |> query_join_view_other_schema()
end

Such functions are indeed views into the schema. :-)

michalmuskala commented 6 years ago

Not really, the problem is that we can't, as of right now, set up associations to those custom query functions or use it in other places where we use schema names directly.

zachdaniel commented 6 years ago

@OvermindDL1 its about simplifying the access pattern. If we can use assoc helper, and bake in these queries as configuration, then the access becomes very simple, and impossible to forget when building queries.

OvermindDL1 commented 6 years ago

@OvermindDL1 its about simplifying the access pattern. If we can use assoc helper, and bake in these queries as configuration, then the access becomes very simple, and impossible to forget when building queries.

That still brings up the question, what then when you need to see the 'deleted' ones, say for an admin log or so? How do you cancel that magic call, or do you have to make a whole copy of the schema without that call?

zachdaniel commented 6 years ago

@OvermindDL1 Hard to say. I personally think that, like views, it would be an anti pattern to bypass their conditions. You would only use this if it would make no sense at all to access resources on that schema without those conditions.

OvermindDL1 commented 6 years ago

You would only use this if it would make no sense at all to access resources on that schema without those conditions.

In that case I would probably use a MATERIALIZED VIEW in the database then, you get a full caching table with your view (just tell it to update when writing the backing tables) so you get full access on it again (though this is also a bit prohibitive if it is a ton of data).

Or just use a function to access the schema instead of using it by name directly, this is a functional language after all. :-)

zachdaniel commented 6 years ago

For us, for example, we have multiple schemas over the same table. We use them to represent slices and/or differing views over an API. For instance, we expose administrative endpoints that would access an /admin_posts whereas our users access /posts. By altering the base query of the schema under the endpoint, we're very protected from nonsensical data getting to that endpoint, like deleted posts, and also it allows the rest of our system to adapt to those conditions. For instance, if we include a calculated field on users which is the number of posts they've made, we do a join to the schema underlying posts, and just a simple count, and it adapts to not include deleted posts. Because those resources make zero sense in user-space outside of that context, its safe to apply it as a blanket clause.

zachdaniel commented 6 years ago

@OvermindDL1 if I said "every time you call foo, make sure you wrap it in a call to bar because it doesn't make sense without it", would you not ask me to redefine foo to include the call to bar? Being functional doesn't make this an invalid use case.

As for materialized views, just a regular view suffices, but has the drawbacks we discussed earlier.

OvermindDL1 commented 6 years ago

For us, for example, we have multiple schemas over the same table. We use them to represent slices and/or differing views over an API. For instance, we expose administrative endpoints that would access an /admin_posts whereas our users access /posts. By altering the base query of the schema under the endpoint, we're very protected from nonsensical data getting to that endpoint, like deleted posts, and also it allows the rest of our system to adapt to those conditions. For instance, if we include a calculated field on users which is the number of posts they've made, we do a join to the schema underlying posts, and just a simple count, and it adapts to not include deleted posts. Because those resources make zero sense in user-space outside of that context, its safe to apply it as a blanket clause.

Ooo, I'd use different schema's there (along with different functions) if there is that much different between those (especially in what is returned).

@OvermindDL1 if I said "every time you call foo, make sure you wrap it in a call to bar because it doesn't make sense without it", would you not ask me to redefine foo to include the call to bar? Being functional doesn't make this an invalid use case.

Actually I'd shuffle the foo away into a hidden namespace (I put hidden things in a :_detail_ namespace) and just expose the bar that returns the properly wrapped foo already. Or add an enforcement (credo is useful for this) that makes it so that a given schema cannot be used outside of a specific module if it still needs to be public. :-)

As for materialized views, just a regular view suffices, but has the drawbacks we discussed earlier.

As I recall, materialized views don't have the drawbacks that regular views suffer though (for reading)?

zachdaniel commented 6 years ago

Ooo, I'd use different schema's there (along with different functions) if there is that much different between those (especially in what is returned).

We do use different schemas. But when accessing them in queries, I call ThatSchema.query() every time.

Actually I'd shuffle the foo away into a hidden namespace (I put hidden things in a :detail namespace) and just expose the bar that returns the properly wrapped foo already. Or add an enforcement (credo is useful for this) that makes it so that a given schema cannot be used outside of a specific module if it still needs to be public. :-)

All I mean is that just because a language is functional doesn't mean that its against the rules to invert the relationship. If the language is functional, then its just functions on both ends. So if we have from call some designated function on schema's passed in, or the caller call that function, where is the difference? Its not magic as long as its explicitly configured and not derived behavior.

As I recall, materialized views don't have the drawbacks that regular views suffer though?

The real issue with views is not in the queries, but in the operational costs of managing and versioning them.

zachdaniel commented 6 years ago

I wonder if perhaps there is a way around the debate here, if we were to put the onus/configuration of this sort of thing into the Ecto.Queryable protocol? For my use cases, if I could tap into that, I could return a query if the module in question is one of my modules. That way we aren't really setting a standard so much as creating an escape valve for those who want it.

OvermindDL1 commented 6 years ago

Or an extension to the join syntax?

zachdaniel commented 6 years ago

@OvermindDL1 I think a critical aspect of this feature for me is being able to configure it in a central place, and for the calling syntax to be plain ecto.

zachdaniel commented 6 years ago

I'm going to get started on some implementation ideas here. An interesting thing is that we could actually support joins + selects in these queries, because of subquerying. We would want to make sure that they only used select_merge or map update syntax for selecting, and that the query's from is always the schema this is defined for. Then they could define fields w/ virtual: true and select more complex statements into them. This is something that I'd also like to get out of this feature, but that I can work around pretty easily in our system. More than willing to let that go, just an idea.

michalmuskala commented 6 years ago

Today the Ecto.Queryable protocol, whenever it needs to convert a module to a query, calls the mod.__schema__(:query) callback (which is implicitly generated by the Ecto.Schema.schema/2 macro).

If you wanted to hack it and abuse the internals, you could do something like:

defmodule Foo do
  use Ecto.Schema

  schama "foo" do
    field :x, :string
    # ...
  end

  defoverridable [__schema__: 1]
  def __schema__(:query), do: where(__MODULE__, x: "foo")
  def __schema__(arg), do: super(arg)
end

This should "just work" everywhere and requires no changes to ecto, but I wouldn't recommend it as a production grade solution.

Probably the simplest way to support it in a less hacky way in ecto directly, would be to expect a __query__/0 function instead, that would be defined as defoverridable out of the box. I think this also would make it pretty clear, it's only used for querying.

That said, I'm still torn on this and I can't say right now if a PR with this feature would be accepted or not.

zachdaniel commented 6 years ago

@michalmuskala There are actually a lot of places where queries are manually constructed, and __schema__(:query) is not called. I know this because what you put up there was pretty much exactly my first attempt at implementing this in our system. I also don't mind if I have to hack it to make it work, as we actually have a higher level DSL that defines an entire API endpoint, which we then use to define the schema and other relevant configurations at compile time (this might go against yours/others design sensibilities, given w/ the way Ecto was designed w/ separation in mind, but it works pretty well for us). With that in mind, its only hacky for us in one place.

So if a middle ground is just fixing the places that don't call __schema__(:query) to get the underlying query, and then me just hacking it like that on my end, I won't really be disappointed.

No worries if it doesn't end up getting approved, we'll definitely figure something out no matter what, and I understand that its important to keep Ecto well designed and healthy, given how central it is in the elixir community.

michalmuskala commented 6 years ago

A PR that fixes all the places that don't use __schema__(:query) will definitely be accepted. I would have expected it to be used everywhere. Do you remember off the hand some particular examples?

zachdaniel commented 6 years ago

Sure! The most apparent one is when calling from you can say from row in MyModule and it manually constructs the Ecto.Query struct, as opposed to calling the underlying __schema__(:query)

https://github.com/elixir-ecto/ecto/blob/master/lib/ecto/query/builder/from.ex#L58

zachdaniel commented 6 years ago

@jos I knew that probably wasn't going to end up being the way to go, but I figured I'd try to highlight the parts that currently don't use __schema__(:query). I'm wondering, what exactly is __schema__(:query) for?

josevalim commented 6 years ago

@zachdaniel it is still invoked when you call Ecto.Queryable directly. It is just the query compiler bypasses it in some cases for performance.

zachdaniel commented 6 years ago

@josevalim Got it. Hopefully we get a few more ideas on ways to integrate/achieve something like this! At this point, as inadvisable as it might be, I'm considering writing my own from macro that just sanitizes the expressions that we use when building queries, and then calls back down into Ecto's from macro. Its not too hard to find and scrub a few target expressions (like assoc and the schemas from join) and would get the job done.

josevalim commented 6 years ago

@zachdaniel maybe this is the best way to go:

defmodule MyApp.VisiblePost do
  def __schema__(:query) do
    from p in MyApp.Post, where: [deleted: false]
  end
end

But again this is not much different from having a function that you call.

zachdaniel commented 6 years ago

@josevalim yeah, having to call Ecto.Queryable.to_query(MyApp.VisiblePost) or MyApp.VisiblePost.__schema__(:query) doesn't do much for us since we're already defining a .query() on all of these modules.

josevalim commented 6 years ago

Ah yes, it wouldn't work because of the optimizations in the other PR.

zachdaniel commented 6 years ago

@josevalim so I had originally wanted to introduce the modifications of the underlying query when the query is built, in order to make them transparent, e.g showing up w/ an IO.inspect, and allowing introspection of those changes after constructing a query. However, it seems like the feature could be implemented when planning the query instead. I worry that it would be weird though. It lets people get into a bad state by screwing up the base query (or whatever we'd call it) and not finding out until they run the query.

zachdaniel commented 6 years ago

It would be a hassle to get assoc working doing something like this, but this is at least a functional proof of concept of wrapping Ecto's from macro to get the behavior that I'm looking for. https://gist.github.com/zachdaniel/6a05803f22ee0270fe4338c3e76a87a5

Not necessarily saying its a good idea, just looking for ways to meet my requirements :D

zachdaniel commented 6 years ago

Just wanted to see how hacky it would have to be on my end, so what I came up w/ is combining the above w/ this change in the schema modules:

      defoverridable [__schema__: 2]

      @spec __schema__(any, any) :: any
      def __schema__(:association, name) do
        association = super(:association, name)
        related = association.related

        if related in api_configs() do
          associated_query = related.query()
          %{association | queryable: associated_query}

        else
          association
        end
      end
      def __schema__(arg1, arg2) do
        super(arg1, arg2)
      end

To be clear though, I don't really want to do things in a hacky way like the above, but I wanted to see what my worst case implementation scenario would look like.

zachdaniel commented 6 years ago

Hey all! Any thoughts on the feasibility of this? Or even some tips on a direction I could maybe take it that has chances of getting approved?

josevalim commented 6 years ago

Hi @zachdaniel, I don't have anything new to share. I don't have a conclusion either. We have discussed this in the past too and we always reached a point we couldn't quite agree on the best to move forward (or even if there is one).

zachdaniel commented 6 years ago

Maybe we could come up with a solution if we thought a long the lines of providing an escape valve. For instance, a way to provide a hook/module to ecto's query planner, to define custom behavior if necessary. Something that wouldn't be recommended, but would be the place people get pointed to when they need to do weird query things. So like "we can't help you if you hook in there, we can't guarantee that you don't break your own queries", but that at least defines the cleanest form of that hook. I'm thinking of like being able to provide a function that takes a table, or takes a schema, and returns the query/new schema that we want it to use. Sounds like something that could be worked into the planner?