doctrine / phpcr-odm

Doctrine PHPCR ODM
http://www.doctrine-project.org/projects/phpcr-odm.html
MIT License
183 stars 97 forks source link

Query builder seems unable to produce a fully working left join #658

Open sarcher opened 9 years ago

sarcher commented 9 years ago

I am attempting to use a LEFT JOIN behavior on child documents, but it seems to be impossible based on what I've seen.

I have the following situation:

Take the following query builder as an example:

$searchTerm = '12345';

$query = $queryBuilder
    ->fromDocument('Path\To\Product', 'p')
    ->addJoinLeftOuter()
        ->right()->document('Path\To\ProductDetail', 'pd')->end()
        ->condition()->child('pd', 'p')->end()
    ->end();
    ->andWhere()
        ->orX()
            ->like()->localName('p')->literal('%' . $searchTerm . '%')->end();
            ->andX()
                ->eq()->field('pd.type')->literal('title')->end()
                ->fullTextSearch('pd.content', $searchTerm)
            ->end()
        ->end()
    ->end()
    ->getQuery();

Now, PHPCR-ODM produces a JCR-SQL2 query like:

SELECT * 
FROM [nt:unstructured] AS p 
LEFT OUTER JOIN [nt:unstructured] AS pd ON ISCHILDNODE(pd, p) 
WHERE (((LOCALNAME(p) LIKE '%12345%' OR (pd.type = 'title' AND CONTAINS(pd.content, '12345'))) 
  AND (p.[phpcr:class] = 'Path\To\Product' OR p.[phpcr:classparents] = 'Path\To\Product'))
  AND (pd.[phpcr:class] = 'Path\To\ProductDetail' OR pd.[phpcr:classparents] = 'Path\To\ProductDetail'))

This all works as expected, if and only if the Product nodes contain at least one ProductDetail child -- in other words, it seems to behave as an INNER JOIN. Any Product node that has no such child will not be included in the results of the query, because of two things:

  1. The query is specifying SELECT * rather than SELECT p.*, which is requiring a result for the pd alias
  2. The query is appending an AND clause at the end, which will only match if the alias pd is present and of the appropriate class

The query that I actually want is this:

SELECT p.* 
FROM [nt:unstructured] AS p 
LEFT OUTER JOIN [nt:unstructured] AS pd ON ISCHILDNODE(pd, p) 
WHERE (((LOCALNAME(p) LIKE '%12345%' OR (pd.type = 'title' AND CONTAINS(pd.content, '12345'))) 
  AND (p.[phpcr:class] = 'Path\To\Product' OR p.[phpcr:classparents] = 'Path\To\Product')))

This works as one would expect a LEFT JOIN to, delivering results whose node names match "12345" regardless of whether or not they contain one or more ProductDetail children.

What I cannot figure out is if it is possible to get to my desired query; is it? This seems to be a bug, but perhaps I am misunderstanding the intent here.

If this is intended, is there a workaround to achieve what I am trying to do? I realize I could just pass the JCR-SQL2 query to the session (my environment uses Jackrabbit) and then manually invoke hydration, but that seems a bit ugly. :)

dantleech commented 9 years ago

From what you describe this would indeed seem to be a bug in the QueryConverter class. A first step would be to add a functional test for this scenario and get the fix in before the upcoming 1.3 release.

sarcher commented 9 years ago

I added a very simple example of the scenario. It uses the existing joins functional test to demonstrate that we have a CmsUser "anonymous" with no associated CmsAuditItems. Querying for that user while performing a left join to the CmsAuditItem document still produces no results (the companion test to this, which does not exist, would be an inner join that correctly produces no results).

The actual query generated is currently:

SELECT * FROM [nt:unstructured] AS a 
LEFT OUTER JOIN [nt:unstructured] AS b 
ON a.username=b.username 
WHERE ((a.username = 'anonymous' 
  AND (a.[phpcr:class] = 'Doctrine\Tests\Models\CMS\CmsUser' OR a.[phpcr:classparents] = 'Doctrine\Tests\Models\CMS\CmsUser')) 
  AND (b.[phpcr:class] = 'Doctrine\Tests\Models\CMS\CmsAuditItem' OR b.[phpcr:classparents] = 'Doctrine\Tests\Models\CMS\CmsAuditItem')) 
  ORDER BY a.username ASC

I think it should be something like:

SELECT a.* FROM [nt:unstructured] AS a 
LEFT OUTER JOIN [nt:unstructured] AS b 
ON a.username=b.username 
WHERE ((a.username = 'anonymous' 
  AND (a.[phpcr:class] = 'Doctrine\Tests\Models\CMS\CmsUser' OR a.[phpcr:classparents] = 'Doctrine\Tests\Models\CMS\CmsUser'))   
  AND (b.[phpcr:class] = 'Doctrine\Tests\Models\CMS\CmsAuditItem' OR b.[phpcr:classparents] = 'Doctrine\Tests\Models\CMS\CmsAuditItem' OR b.[phpcr:class] IS NULL)) 
  ORDER BY a.username ASC

This one would actually work with a SELECT * rather than a SELECT a.* because we aren't returning a result set that includes any of the aliased b documents. However, when the result set includes, for example, two of a, one of which includes a b, an exception will be thrown on that dual result:

  [PHPCR\RepositoryException]                                 
  Attempting to get the path for a non existent selector: b

That of course should be another test, but I wanted to illustrate this with something simple to confirm it is actually a bug.

sarcher commented 9 years ago

I was able to make a few changes to the ConverterBase and ConverterPhpcr classes that allowed my proposed query to be generated. Unfortunately, though, even with the addition of OR b.[phpcr:class] IS NULL] it still does not work, and after a lot of research I can't determine if this is something that JCR-SQL2 can even do.

The obvious solution is to use a jcr:primaryType other than nt:unstructured, but that would prevent any change made here from being a more general solution, as the PHPCR-ODM default is to use un-typed nodes. As far as I can tell right now, a fully functioning left join between nt:unstructured nodes does not seem possible, though I am not sure if this is a feature of JCR-SQL2, a bug in Jackrabbit, or just something I am doing wrong.

Let me know if this could be more clear; I also posed the JCR-SQL2 question on StackOverflow.

dantleech commented 9 years ago

Thanks for the effort. Using the jcr:primaryType would, imo, also be the best thing to do ideally. Unfortunately the Jackrabbit implementation the Java side) does not allow the jcr:primaryType to be modified easily (it requires that you basically manually delete the repositotry iirc). Doctrine-DBAL does not suffer from this limitation (nor does Jackrabbit OAK, but we don't have a client for that yet).

But I will also try and look at this and see if we can't find a workaround here when I have some spare time.

sarcher commented 9 years ago

If you could review my most recent commit above, I'd be curious to hear if this could be a valid strategy that could be a candidate for a PR.

What this does is add a third optional parameter to the document() method called $stronglyTyped, which if true will allow the query to omit the class and classparents portion of the query for that document only. That means that a real outer join is possible if a document is mapped to its own type, which provides a real workaround to this problem for anybody who encounters it.

This solves the problem I was struggling with in my own code base by allowing such a join, provided I migrate the affected nodes to a new type.

This could still use a few more tests and a documentation update, which I'm happy to provide if this is an acceptable idea.

dbu commented 9 years ago

thanks for working on this @sarcher ! i do think this is a bug and not intended.

does the "The query that I actually want is this:" work as it should?

if so, to solve the original problem, we would need the query converter to handle left (and right?) join specifically by adding the prefix to SELECT and make the pd.class pd.classparent in a way that its also allowed to not exist. something like pd.[phpcr:class] IS NULL OR pd.[phpcr:class] = .... if pd does exist, you do want it to be of the right class i think. i see lots of tests that check "IS NOT NULL" but none that does "IS NULL". there might indeed be a bug with jackalope-doctrine-dbal on that. does it work with jackalope-jackrabbit?

if we can manage this, it probably makes more sense than adding a new type of query. though using types could be an interesting thing regardless of this bug.

sarcher commented 9 years ago

That is actually the problem; if the query includes the pd.[phpcr:class] IS NULL OR ... it never matches. I have not been able to discover if this is a bug in Jackalope or a problem with Jackrabbit itself. When I run the query directly from the command line (e.g. ./bin/phpcrodm phpcr:workspace:query "...") it fails.

That was definitely my first approach though, just to see if we could add the IS NULL to the class/classparents check, which would seem the most reasonable and least invasive solution. In the absence of that though, if (and only if) you have a unique type mapped to a document, you can omit the class/classparents check, which is what I have done for now.

I was motivated to make something work in the short term to unblock my project, which is now working with the commit above, though I am certainly interested in a long term/non-fork solution. Maybe if it is a Jackalope problem, we can go back to simply adding the IS NULL to the check.

dbu commented 9 years ago

well, then lets wrap up the case for the select by type first if that is already working for you.

dantleech commented 9 years ago

Instead of adding a parameter, could we not automatically use the PHPCR primary type IF it has been defined? I can't really think of any reason why that wouldn't be the optimal strategy in most if not all cases.

sarcher commented 9 years ago

My thought around that was that there are two problems:

As it relates to my use case, I am fine with either direction and could implement either one.

Thanks for helping me find a solution here!

dantleech commented 9 years ago

Reading metadata will/should not incur a significant overhead, metadata is cached and we read from it hundreds of times per-request. Also we could cache queries. For the second point, the current behavior is wrong, so it's a bug fix - I would be OK with this /cc @dbu

On 19 September 2015 03:57:17 BST, Shane Archer notifications@github.com wrote:

My thought around that was that there are two problems:

  • In order to do this, we would need to look at all mapped documents each time we build a query in order to determine whether or not one is mapped to a unique type. It seemed as though this might introduce a performance concern, but, maybe that is not a big deal (I am uncertain of the real world cost of reading in the class metadata). Of course, we could also just do this for any type that isn't nt:unstructured, but I feel like this could introduce very confusing problems for more casual developers who do not understand the mechanics of the JCR protocol. Maybe this is fine though.
  • It is feasible to me that somebody out there has implemented one of the existing outer joins and is relying on this incorrect behavior. So I thought that by changing the default behavior, there is an increased risk of an unintentional BC break by including results that would not have previously been included. Instead, by adding a parameter we guarantee that all existing implementations are unaffected. Maybe this is a fine thing to fix, though, and we don't have to worry about this case.

As it relates to my use case, I am fine with either direction and could implement either one.

Thanks for helping me find a solution here!


Reply to this email directly or view it on GitHub: https://github.com/doctrine/phpcr-odm/issues/658#issuecomment-141611785

Sent from my Android device with K-9 Mail. Please excuse my brevity.

dbu commented 9 years ago

i am not sure if the overhead is that small. reading metadata of a loaded document is easy. but finding the metadata of all mapped documents could be more work and might not even be possible, i am not sure. doing it for any specific type sounds dangerous. one option instead of an argument could be to add something to the metadata instead of on query. then you only do this once and gain performance with it in queries. (it could be used not only for join but for all selects)

regarding BC breaks: every bugfix of something that was behaving incorrectly has the potential to screw up somebody who relied on the incorrect behaviour. doing such a fix in a minor version, with a note in the changelog seems good to me. if they wanted a full join they should not have used a left join, sorry.

dantleech commented 9 years ago

Why would we need to load all the metadatas? We only need the metadata for documents being selected from in the query (if I have understood), and this metadata is already loaded for each joined source (I believe from a quick scan of the code).

My opinion here is that adding "workarounds" to the interface (API or metadata) is bad, but would prefer the explicit metadata approach to the API change and would prefer the implicit "workaround" with the metadata to the explicit metadata approach (if it has no significant performance penalty). And finally I would prefer that it just worked to all of the above :)

dbu commented 9 years ago

Why would we need to load all the metadatas?

if we want to know whether a phpcr type is unique, we need to check all documents to see if any other uses the same type.

sarcher commented 9 years ago

Yea, that is correct. If we just looked to see "does this have a type other than unstructured" then it would lead to some very weird and bad situations if somebody happened to map two documents to the same type (which is otherwise a very possible scenario).

Perhaps the best solution would be to add a new class metadata property, calculated at the time class metadata is cached, which indicates whether or not a type is unique in the space of all known, mapped documents. I am not entirely sure how to do this, though, since it looks like metadata is calculated and cached on a class-by-class basis (and that is one area of Doctrine I have not really played around with).

That would certainly make the user experience better, since all you need to do is map something to a unique type and then everything just works as expected.

dbu commented 9 years ago

in production mode, there is a cache warmer that does all the work upfront and looks in all known places for documents. i don't know exactly how, but here it could work.

in debug mode, the metadata and proxies are done on the fly, when a specific class is needed. i fear that it could be really expensive as each time any class changed we would have to re-scan everything. or we would need to build some sort of cache that we then use to check which classes/metadata changed since the cache was built.

if we ask the user to annotate documents he thinks have a unique type and have a validation command for that information that we use when cache warming and can be triggered manually. that way, the information is available fast and without too much overhead. this would have the added benefit of being explicit that something is supposed to be unique. this will prevent weird issues when somebody copy-pastes a document to do a new document and is confused why his left joins stop working...

sarcher commented 9 years ago

So to summarize, it sounds like we would:

If that is a reasonable solution, I can begin to take a stab at it.

dbu commented 9 years ago

sounds good to me. great that you want to work on this! and thanks for including the doc in this list, very important !

the validator should be a class that can be used independently, and then we provide a command that triggers this validator. the cache warming (not sure if that is here or in the bundle) then uses a service, not the other command.

apart from this your list seems pretty complete to me. adding tests would be nice - ideally also one where validation fails, but that will be interesting to do without breaking all other tests :-)

can you create a new issue with this? or directly a pull request. use * [ ] ... to get checkboxes in the description that you can click to mark steps as done.

sarcher commented 9 years ago

Thanks for the info -- I will create the PR once I have some code laid down, I have one other project that got in front of this but hope to get to it soon.

dbu commented 9 years ago

great!

sarcher commented 9 years ago

I squashed everything, rebased, and opened #667 with the changes to this repository. Will move forward with the other work in phpcr-bundle and phpcr-odm-documentation.

sarcher commented 9 years ago

Work is now complete in #667 and ready for review. Thanks for the guidance!

dbu commented 9 years ago

with #667, it is now possible to outer join based on the node type, when using unique node types.

with the fixes on jackalope, i think the OR NULL should now work as well: https://github.com/doctrine/phpcr-odm/issues/658#issuecomment-138124740 - the exception you describe is exactly what we get rid of in https://github.com/jackalope/jackalope-jackrabbit/pull/127 / https://github.com/jackalope/jackalope-doctrine-dbal/pull/308 (see the test https://github.com/phpcr/phpcr-api-tests/pull/174 that is now working)

if you could have a look at that, would be awesome! then outer join would work even without the unique node types.