This is more a comment than an issue. When joining tables using subqueries (in order to help PaQu do the right thing), one needs to do a lot of manual aliasing. Here's an example triple join I wanted to do:
SELECT p.fofTreeId, p.treeRootId, p.treeSnapnum, p.mass, z.zred
FROM MDR1.FOFMtree AS p,
(SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000) AS mycl,
(SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
AND z.treeSnapnum = p.treeSnapnum
ORDER BY p.treeRootId, p.treeSnapnum
It executes correctly, but takes long (~ 6 min in this case), because PaQu first joins the TreeSnapnum-redshifts with the whole FOFMtree-table; only then it joins with mycl.
Thus I put the desired first step into a subquery and wrap the redshift-join around:
SELECT mb.fofTreeId, mb.treeRootId, mb.treeSnapnum, mb.mass, z.zred
FROM
( SELECT p.fofTreeId, p.treeRootId, p.treeSnapnum, p.mass
FROM MDR1.FOFMtree AS p,
(SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000)
AS mycl
WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
) AS mb,
(SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE z.treeSnapnum = mb.treeSnapnum
ORDER BY mb.treeRootId, mb.treeSnapnum
This produces the error: Unknown column 'mb.treeSnapnum' in 'field list'.
Here is the full query plan:
-- CALL paquExec('SELECT `fofTreeId` AS `fofTreeId`,`mainLeafId` AS `mainLeafId` FROM MDR1.FOFMtree WHERE ( `fofTreeId` = 100000000 ) ', 'aggregation_tmp_95336591')
-- CALL paquExec('SELECT `p`.`fofTreeId` AS `p.fofTreeId`,`p`.`treeRootId` AS `p.treeRootId`,`p`.`treeSnapnum` AS `p.treeSnapnum`,`p`.`mass` AS `p.mass` FROM MDR1.FOFMtree AS `p` JOIN ( SELECT `fofTreeId`,`mainLeafId` FROM `aggregation_tmp_95336591` ) AS `mycl` WHERE ( `p`.`fofTreeId` BETWEEN `mycl`.`fofTreeId` AND `mycl`.`mainLeafId` ) ', 'aggregation_tmp_82862933')
-- CALL paquExec('SELECT DISTINCT `zred` AS `zred`,`treeSnapnum` AS `treeSnapnum` FROM MDR1.TreeSnapnums ', 'aggregation_tmp_28929384')
-- CREATE DATABASE IF NOT EXISTS spider_tmp_shard
-- USE spider_tmp_shard
-- CREATE TABLE spider_tmp_shard.aggregation_tmp_4510995 ENGINE=MyISAM SELECT `mb`.`p.fofTreeId` AS `mb.fofTreeId`,`mb`.`p.treeRootId` AS `mb.treeRootId`,`mb`.`treeSnapnum` AS `mb.treeSnapnum`,`mb`.`p.mass` AS `mb.mass`,`z`.`z.zred` AS `z.zred` FROM ( SELECT `p.fofTreeId`,`p.treeRootId`,`p.treeSnapnum`,`p.mass` FROM `aggregation_tmp_82862933` ) AS `mb` JOIN ( SELECT DISTINCT `zred`,`treeSnapnum` FROM `aggregation_tmp_28929384` ) AS `z` WHERE ( `z`.`treeSnapnum` = `mb`.`treeSnapnum` ) LIMIT 0
-- USE spider_tmp_shard
-- INSERT INTO spider_tmp_shard.aggregation_tmp_4510995 SELECT `mb`.`p.fofTreeId` AS `mb.fofTreeId`,`mb`.`p.treeRootId` AS `mb.treeRootId`,`mb`.`treeSnapnum` AS `mb.treeSnapnum`,`mb`.`p.mass` AS `mb.mass`,`z`.`z.zred` AS `z.zred` FROM ( SELECT `p.fofTreeId`,`p.treeRootId`,`p.treeSnapnum`,`p.mass` FROM `aggregation_tmp_82862933` ) AS `mb` JOIN ( SELECT DISTINCT `zred`,`treeSnapnum` FROM `aggregation_tmp_28929384` ) AS `z` WHERE ( `z`.`treeSnapnum` = `mb`.`treeSnapnum` )
-- CALL paquLinkTmp('aggregation_tmp_4510995')
-- CALL paquDropTmp('aggregation_tmp_95336591')
-- CALL paquDropTmp('aggregation_tmp_82862933')
-- CALL paquDropTmp('aggregation_tmp_28929384')
-- USE spider_tmp_shard
-- SET @i=0
-- CREATE TABLE cosmosim_user_kristin.`2015-01-09-11-10-23-9346` ENGINE=MyISAM SELECT @i:=@i+1 AS `row_id`, `mb.fofTreeId`,`mb.treeRootId`,`mb.treeSnapnum`,`mb.mass`,`z.zred` FROM `aggregation_tmp_4510995` ORDER BY `mb.treeRootId` ASC,`mb.treeSnapnum` ASC
-- CALL paquDropTmp('aggregation_tmp_4510995')
I don't really understand the error message, since there is no mb.treeSnapnum in the reformulated field lists, it's always mb`.`treeSnapnum` AS `mb.treeSnapnum. Maybe I'm missing something here. If I try something like using mb.treeSnapnum in the final WHERE instead, then I'll get an SQL syntax error:
ERROR 1054: Unknown column 'mb.treeSnapnum' in 'where clause'
Current workaround is to alias the columns myself explicitely nearly everywhere, but it's quite some overhead for many columns:
SELECT mb.fofTreeId as fofTreeId, mb.treeRootId as treeRootId, mb.treeSnapnum as treeSnapnum, mb.mass as mass, z.zred as zred
FROM
( SELECT p.fofTreeId as fofTreeId, p.treeRootId as treeRootId, p.treeSnapnum as treeSnapnum, p.mass as mass
FROM MDR1.FOFMtree AS p,
(SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000)
AS mycl
WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
) AS mb,
(SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE z.treeSnapnum = mb.treeSnapnum
ORDER BY treeRootId, treeSnapnum
This query is fast and executes correctly. I just wished all the aliasing could be done automatically.
Any suggestions to handle this in a better way?
This is more a comment than an issue. When joining tables using subqueries (in order to help PaQu do the right thing), one needs to do a lot of manual aliasing. Here's an example triple join I wanted to do:
It executes correctly, but takes long (~ 6 min in this case), because PaQu first joins the TreeSnapnum-redshifts with the whole FOFMtree-table; only then it joins with mycl. Thus I put the desired first step into a subquery and wrap the redshift-join around:
This produces the error:
Unknown column 'mb.treeSnapnum' in 'field list'
. Here is the full query plan:I don't really understand the error message, since there is no
mb.treeSnapnum
in the reformulated field lists, it's alwaysmb`.`treeSnapnum` AS `mb.treeSnapnum
. Maybe I'm missing something here. If I try something like usingmb.treeSnapnum
in the final WHERE instead, then I'll get an SQL syntax error:ERROR 1054: Unknown column 'mb.treeSnapnum' in 'where clause'
Current workaround is to alias the columns myself explicitely nearly everywhere, but it's quite some overhead for many columns:
This query is fast and executes correctly. I just wished all the aliasing could be done automatically. Any suggestions to handle this in a better way?