doctrine / orm

Doctrine Object Relational Mapper (ORM)
https://www.doctrine-project.org/projects/orm.html
MIT License
9.92k stars 2.51k forks source link

Deprecate JOIN ... WITH keyword for join associations #10978

Open beberlei opened 11 months ago

beberlei commented 11 months ago

There are two ways to join entities and both can be restricted further using the WITH keyword:

  1. using join assocations.

    SELECT u, a FROM App\Entity\User u JOIN u.addresses a WITH a.country = 'DE'
  2. using range variable declaration:

    SELECT u, a FROM App\Entity\User u JOIN App\Entity\Address a WITH u.id = a.user AND a.country = 'DE'

The first one will fetch all addresses into the collection User::$addresses, the second one will cause a mixed result to emit both entities at the same time.

Case 1 will causes problems, because looking at a collection, you never know if its fully or only partially loaded. And reloading the collection is also not possible in a meaningful way. See this comment https://github.com/doctrine/orm/issues/2878#issuecomment-162361011

As such, we are looking to deprecate and remove case 1, and also deprecate restricting by an association that is fetched joined, say:

SELECT u, a FROM App\Entity\User u JOIN u.addresses a WHERE a.country = 'DE'

What the users mean to do is probably this:

SELECT u, a FROM App\Entity\User u JOIN u.addresses a WHERE u.id IN (SELECT a2.user FROM App\Entity\Address a2 WHERE a2.country = 'DE')
mpdude commented 11 months ago

Thoughts:

The very last DQL example given, SELECT ... WHERE ... IN (SELECT ...) is not problematic performance-wise or hit limits (like number of IDs possible), right?

Instead of

SELECT u, a FROM App\Entity\User u JOIN u.addresses a WHERE a.country = 'DE'

can I also write this?

SELECT u, address_fetch FROM App\Entity\User u JOIN u.addresses address_select JOIN u.addresses address_fetch WHERE address_select.country = 'DE'
mpdude commented 11 months ago

Considering the following use case:

A User has many Pictures, and of those only a very few have a featured property.

I'd like to display a gallery of many/all users with their "featured" pictures.

SELECT u, p FROM User u JOIN u.pictures p WITH p.featured = 1 will take one query, and I can iterate over the users and get the pictures I need.

Not fetch-joining the Picture but using collection filtering on each User's picture collection will take one extra query per user displayed.

I do see the first approach is brittle, since when iterating over the users and accessing their pictures, nothing in the code shows that the list of pictures is actually limited. It only is due to the way the collection was first initialized, and only works as long as the collection was uninitialized before.

SELECT u, p FROM User u JOIN Picture p WITH u.id = p.user AND p.featured = 1 gives me a different result structure, and I could no longer use my model's "natural" API to access the data, juggling with a mixed-mode result array instead.

sips-richard commented 10 months ago

Last week the change in https://github.com/doctrine/orm/pull/8391 broke one of our queries and I'm trying to get a better sense of the scope of these changes and whether there is more functionality that we rely on that will disappear in future.

The query in question uses the ORM with a one-to-many which we're left joining twice using WITH rel.priority = 1 and WITH rel.priority = 2 respectively (where rel may have 0, 1, or 2 records). Normally in the application this relationship is eager loaded as part of querying the parent entity. This particular query is for a larger data export though. We try to stick to the ORM where possible to avoid re-declaring loads of stuff as pure SQL all over the place.

For now I've redefined the relationships as lazy loaded. Using $query->setFetchMode() doesn't seem to override the defaults -- I'm guessing because it doesn't take the query hints into account during the SQL walking from where the exception is thrown? Is this intentional?

Is WITH going to disappear completely, and should I just ditch trying to re-use ORM for anything remotely edge case? If it's going to be better to declare stuff as pure SQL for problems like these then fair enough.

CabanesThibault commented 10 months ago

Hi, I also have some questions about thoses evolutions around the use of the 'WITH' keyword in queries.

It seems that all the encountered issues are related to the Collections, but what about the ManyToOne and OneToOne relationships?

Sometimes (It may be a bad practice?), to filter a query, I use a Dql innerJoin with a 'WITH' statement, so the request only returns the entities with a related entity that met the condition.

Let's say I have an App\Entity\Address with a MtO or OtO user and App\Entity\User with no 'address' field, the relation is only on the address entity with no Collections involved :

SELECT u FROM App\Entity\User u INNER JOIN App\Entity\Address a WITH a.user = u.id AND a.country = 'DE'

Isn't it a bit radical to totaly deprecate the WITH statement?

Should I change all of my queries that use it? (And was it a bad practice? ^^')

(Initially, I assumed that filtering on the relation would be more efficient than filtering in the where clause)

bakugo commented 10 months ago

Please, no more deprecations of essential features I rely on, at this rate I'm never going to upgrade to 3.x.

In addition to the use case mentioned by @CabanesThibault where you simply don't SELECT the joined association and only use it for filtering (which does not result in any partially loaded collections), I am making use of WITH with KnpLabs' Translatable to only load the translations for the language(s) I actually need during a given request instead of all of them, and if WITH is removed, I don't see any other way to avoid loading all languages at once (which would result in a noticeable performance hit in my case).

sips-richard commented 10 months ago

@bakugo I migrated my query that was using WITH to the DBAL query builder, but in my case it was a single report query and it didn't really require the ORM (it was just simpler to re-use all the ORM relationships than redeclare all the joins). You could do similar and set up a result set mapping to do the ORM hydration I think. But generally I think the DBAL query builder might be the way to go with any more complex use cases.

A greater source of frustration for me right now is that OneToMany with eager loading is broken as of 2.17 if you're using custom data types for your primary keys (e.g. binary UUID). I'm guessing is an unintended side effect of the other changes to the WITH logic (ref: https://github.com/doctrine/orm/issues/11088). We've had to pin at 2.16 for now.

beberlei commented 10 months ago

@sips-richard WITH is not going away generally, just maybe for some use-cases. We removed it from the 2.17 milestone because we didn't get it to work reliably yet.

@CabanesThibault your posted use case is exactly what the WITH keyword is for (point 2 in my post, range variable declarations). That is not going away.

@bakugo why not use an extra lazy collection in your case with indexBy? That would just fetch a single translation when you do for example $article->translations['en']->getTitle().

bakugo commented 10 months ago

@beberlei

why not use an extra lazy collection in your case with indexBy? That would just fetch a single translation when you do for example $article->translations['en']->getTitle().

Because that would run a query for each translatable, and I often render several hundreds of them per page so doing a query for each of them is out of the question. It's a waste of processing power regardless when WITH allows me to batch load the exact translations I need for a given request in a single query. Removing features like this might make sense when the the only use case you have in mind is a basic blog or whatever, but more demanding use cases exist.

beberlei commented 9 months ago

@bakugo there are still more than enough ways available for you to select the right translations with a single query, just this way is not supported anymore.

bakugo commented 9 months ago

@beberlei Could you describe some of them, then?

RobertMe commented 8 months ago

@beberlei could it be that https://github.com/doctrine/orm/commit/76fd34f76607b1b96f381377c1c51df292c759aa already partially removes support for JOIN ... WITH?

I'm currently running into an issue where I have a ManyToOne with fetch: eager and I'm getting the exception "Associations with fetch-mode=EAGER may not be using WITH conditions in ...".

And while I do understand your reasoning for collections as you mention in this tickets description (I already came to the same conclusion while reasoning about the error, that applying a WITH will ("eager") load a collection only partially) I think this reasoning doesn't really apply to ManyToOne associations?

I.e.: my use case is like:

#[ORM\Entity]
class Parent
{
    #[ORM\Column(type: 'string')]
    private $name;
}

#[ORM\Entity]
class Child
{
    #[ORM\ManyToOne(targetEntity: Parent::class, fetch: 'EAGER')]
    private Parent $parent;
}

where I would then like to run a query like SELECT c FROM Child c JOIN c.parent p WITH p.name = 'foo'. This fails because of the WITH + fetch: 'EAGER'. But as said, the issue you mention for which this is being deprecated doesn't apply to this case? As the WITH clause will not limit any collection (i.e.: the c.parent association to which the WITH is applied is not a collection), it will only limit the result set as a whole. So while as you mentioned it can be rewritten to a plain WHERE clause instead of a WITH (as I have currently done) I still think this is a BC break which also isn't mentioned in the UPGRADE.md. And furthermore changing it into a WHERE is only possible for INNER JOINs, but when the query would be a LEFT/RIGHT JOIN there obviously is a difference between applying a filter in the ON clause vs the WHERE clause.

So I think the following check should take the relation type into account, and exclude ToOne associations (/only include ToMany associations):

if ($relation['fetch'] === ClassMetadata::FETCH_EAGER && $condExpr !== null) {

This is by the way both happening on 2.17.1 and 2.17.2 (latest). While it still worked on 2.16.2.

ger86 commented 7 months ago

@bakugo , do you find one of the ways to solve your problem?

bakugo commented 7 months ago

@ger86 No, I have not, and judging by the lack of response to my previous question, I assume no other elegant solution exists.

fishbone1 commented 5 months ago

WHAT? Do I understand correctly? Please, no! This is a really really important feature you are removing here. I still hope that I misunderstood something here.

Case 1 will causes problems, because looking at a collection, you never know if its fully or only partially loaded

What does this even mean? Your entities are always just a subset of the whole database. Will you disallow the WHERE statement in the future because one would not know if a collection named $users does contain ALL users?

I do see the first approach is brittle, since when iterating over the users and accessing their pictures, nothing in the code shows that the list of pictures is actually limited.

Well it's the exact same argument like my $users case above where no one tells you that it doesn't contain ALL users. Just properly name your symbols. Don't call it pictures, but featuredPictures instead.

This will break our whole application. It's a big and complex application and JOIN...WITH is everywhere. We have languages, tenants and users who are only allowed to see the data they are authorized for. And we have great performance and always loved Doctrine for features like that. In the past, devs used for loops to filter out the invalid entries. For at least 10 years we taught them to use JOIN instead, in combination with WITH if needed.

I see your problem with it as well. What I don't like about JOIN...WITH is that behavior depends on whether you have a join or not. Yes, above I stated you could just rename picturesto featuredPictures. But this is not enough: If you access the User entity in another context (without JOIN) then suddenly featuredPictures will contain ALL pictures, because they are either lazily or eagerly loaded. Now think about the other use case, where users see only the datasets they are owning (or authorized for). It would be a big security issue if users can access foreign/unauthorized related datasets, just because a dev forgot a join.

So this is my proposal to solve that:

Please introduce another fetch type called MANUAL. If an association has fetch type MANUAL it's only allowed to access it if it's fetch joined. It means

Don't fetch it automatically. Neither eagerly nor lazily. I want to define manually how it's fetched

In the example above with featuredPictures - well you are not allowed to access it in this other context. If you really need ALL pictures, then you have to use another association (with LAZY/EXTRA LAZY/EAGER). On the other hand, WITH will still be allowed, but only together with fetch type MANUAL.

What do you think about it?

Wiktor6 commented 3 months ago

@bakugo how do you handle this problem? I'm also using Knp/Translatable, and now I'm updating Doctrine. I have a lot of Join::WITH to fetch tanslations only for one language, but now It's not possible using join with.

bakugo commented 3 months ago

@Wiktor6 I don't, I'm still on 2.x and will not be upgrading any time soon due to multiple removals of features that my application relies on without any replacements.

b3k commented 1 month ago

I just tried to upgrade a rather large application from Doctrine 2.x to 3.x, and it stopped working due to the "Associations with fetch-mode=EAGER may not be using WITH conditions in" error. That's how I found this post, and I'm shocked because using WITH in join conditions is a basic tool for optimal relation joining, and now it stops working?

Did anyone make some working fork of 3.x without this changes ?

oojacoboo commented 1 month ago

Case 1 will causes problems, because looking at a collection, you never know if its fully or only partially loaded. And reloading the collection is also not possible in a meaningful way.

@beberlei I get the issue here - fine. But the importance of WITH conditions is far greater IMO, than reloading some relationship collection. What is even the practical use case for reloading collections? Or, what's the practical use case for needing to know if a collection is fully or partially loaded? I'm sure there are some. But are they more important than WITH conditions on join statements...?

Why not just throw an exception if you try to reload or determine if a collection is fully loaded when it's ambiguous due to a JOIN WITH? This would be far better than just removing this functionality in a crude attempt to preserve this integrity.

Using JOIN WITH is a highly performant way of doing joins. If you're just pure joining on a large relationship, you could be joining on thousands or even millions of records (hopefully not, but certainly possible). You then have to filter those through a WHERE clause, which is less performant, or even worse, in PHP.

An EAGER fetch ensures that a JOIN is performed on relationships - sometimes this is required for data integrity reasons (INNER JOIN). You cannot always rely on LAZY fetching. This is especially the case if you've implemented filters and multi-tenancy.

What can be done about this? Is the expectation that everyone is to refactor out of DQL and into SQL to get around this mess? Are you really suggesting WHERE IN sub-select as the solution to this issue? That won't resolve the issue of a collection refresh in any "meaningful way", as the refresh would result in a population of all records of that relationship.

beberlei commented 1 month ago

@oojacoboo You are leading with the statement that you understand the issue and then follow up with a bunch of reasons showing me that you don't.

The problem here is simple:

  1. Code makes DQL query fetch joining association using WITH, loading an entity A with Collection A::$b - only filled with a subset of entities.
  2. Other part of code makes a EntityManager::find (or really any other query) for A and expecting A::$b to be the full collection.

It makes no difference to me if we have the 2.x behavior (not having the full collection) or your suggestion (throw an exception), they are both highly problematic, because a programmer cannot look at the code in (2) in any way to understand that it breaks.

WITH and collection fetching breaks the contract of the object relational mapper and thus is removed to reach better consistency. I agree that its sad this removes a performance optimization.

A few workarounds for now are:

  1. Using index-by on the collection and fetch extra lazy. Would fetch a single entity, making a query though.
  2. Using Collection::matching to get a subset of entities from the collection.

A fix for the future, but this would require a lot of planning, could be using JOIN WITH fetch collection queries to pre-fill a collection that is returned from ->matching() given the same conditions.

$a = $em->createQuery("SELECT p, c FROM Post p JOIN p.comments c WITH c.published = true")->getSingleResult();

$comments = $a->comments->toArray(); // makes a query
$comments = $a->comments->matching(Criteria::create()->where(Criteria::expr()->eq("published", true))); // makes no query, DQL already created this collection
beberlei commented 1 month ago

Again here, Doctrine 2.x is maintained for another 2 years at least, and we are still working on providing more forward compatible ways to migrate for these BC breaks, so I recommend people running into this to just stay on 2.x longer.

fishbone1 commented 1 month ago

@beberlei Could you please respond to my arguments as well? I guess I still don't understand the point. In my eyes you would have to forbid lots of other things such as WHERE according to your arguments as well. And also nested INNER JOINs, because they also could lead to incomplete subsets:

/* Select posts that have published comments with those published comments => Incomplete comments */
SELECT p
FROM Post p
JOIN p.comments
WHERE c.published = true
/* comment subsets completeness depends on whether there are comment responses: */
SELECT p, c, r
FROM Post p
LEFT JOIN p.comments c
INNER JOIN c.responses r

So you suggest as a fix to load everything into RAM and then filter it out aferwards? That's not a scalable solution at all. I'm starting to wonder if Doctrine is just not the right tool for our big applications (anymore). Is it just meant for small blog-like applications? Scalability always has been one of the most important factors for me. If you scale the applications bigger, it's not just a little "performance optimization" that you are removing here. On the other hand, why even bother using a high performance scalable relational database with such small blog applications in mind? In big business applications you won't get past the problem of subsets. @bakugo provided a good example with translations.

Did you think about my suggestion for an additional fetch type (MANUAL)?