Improve handling of joins

carlomanf commented 1 year ago

The query transformation process currently does nothing with joins, but it would be worth adding some handling. Consider the following query:

SELECT * FROM `wp_term_relationships` NATURAL JOIN `wp_term_taxonomy` WHERE `object_id` = 1

The plugin will replace the relationships table with the blog that corresponds to post ID number 1, but it will replace the other table with a union because it has no post ID column.

Similarly:

SELECT * FROM `wp_term_relationships` NATURAL JOIN `wp_term_taxonomy`

In the above case, both tables will be replaced by unions and the unions then joined.

The first example would be better handled by carrying through the blog replacement to all joined tables, eliminating any unions, while the second example would be better handled by unioning the joins rather than joining the unions. In both cases, the term relationship "mismatch" problem described at #11 would be solved, but only in the case that the original query actually uses joining.

Whether this will need to be done as part of #11 remains to be seen, but it will be a good enhancement in any case to eliminate unnecessary unions and spurious join results.

carlomanf commented 1 year ago

While working on #11, it became clear that adding a join would be an appropriate way to solve the "mismatch" issue, in the case that a join is not already present. Therefore, the pre-existing presence of a join needs to be checked beforehand, which suggests that it would make sense to handle these issues together.

carlomanf commented 1 year ago

In both cases, the term relationship "mismatch" problem described at #11 would be solved, but only in the case that the original query actually uses joining.

To be clear, an example of a query that does not use joining would be:

SELECT * FROM `wp_posts` WHERE `ID` IN (SELECT `object_id` FROM `wp_term_relationships` WHERE `term_taxonomy_id` = 1)

Currently, the relationships table will be correctly replaced, but it may fetch mismatched post ID's, which then filter through to the outer query to return the wrong posts from the wrong posts table.

The solution is to join the relationships table with its corresponding posts table, which will cause any mismatched post ID's to be dropped and thus not filter through to the outer query. Care should be taken not to change or invalidate the column set in the subquery, particularly if it is * or includes term_taxonomy_id without specifying the table.

carlomanf commented 1 year ago

The first example would be better handled by carrying through the blog replacement to all joined tables, eliminating any unions, while the second example would be better handled by unioning the joins rather than joining the unions.

The task for example query 2 has turned out to be a conundrum. The union needs to be inside a subquery in order for the limit and count clauses to work, but a join inside a subquery will throw an error if it contains duplicate column names, (e.g. term_taxonomy_id) so I am currently unsure how to proceed with this task.

As mentioned earlier, the task for example query 3 also has some hazards of its own, so it might be best to focus on the task for example query 1, which I think represents the majority of affected queries anyway.

carlomanf commented 6 months ago

0b07b0a appears to have fixed the queries it was designed to fix, but it also seems that I was wrong about most of the queries falling under case 1. It actually seems that most are under case 2, which should not be too difficult to fix for queries that don't select any duplicate column names from joined tables, but I can see that some core queries do select duplicate column names.

Also, the natural join example in the issue description actually does not get transformed for some reason, although other types of join do. I will investigate what's happening with the natural join when I get time.

carlomanf commented 3 months ago

Care should be taken not to change or invalidate the column set in the subquery, particularly if it is * or includes term_taxonomy_id without specifying the table.

The union needs to be inside a subquery in order for the limit and count clauses to work, but a join inside a subquery will throw an error if it contains duplicate column names, (e.g. term_taxonomy_id) so I am currently unsure how to proceed with this task.

As mentioned earlier, the task for example query 3 also has some hazards of its own

These actually seem to be the same hazards so they could probably be solved together.

carlomanf commented 2 months ago

It seems that I am finally seeing a way to solve this issue totally. The solution is to semi-join the relationships table with either the posts table, the taxonomy terms table, or both, depending on whether there is an ID being queried or not. If an ID is being queried, its table does not need to be semi-joined. If there is no ID, the query can proceed to a union as it already does, but the relationships table needs to be semi-joined with both other tables within its union.

As mentioned before, the addition of the semi-joins must not change or invalidate the query if * or term_taxonomy_id are being selected. Specifying the relationships table name or alias should fix this.

If my testing is correct, this should give correct output for both example query 2 and 3.

carlomanf / wp-super-network

Improve handling of joins #19