adrpar / paqu

A parallel query engine for MySQL + Spider Engine built on a fork of shard-query
GNU General Public License v2.0
10 stars 0 forks source link

Comment: field not found in select list with nested subqueries #36

Open kristinriebe opened 9 years ago

kristinriebe commented 9 years ago

This is more a comment than an issue. When joining tables using subqueries (in order to help PaQu do the right thing), one needs to do a lot of manual aliasing. Here's an example triple join I wanted to do:

SELECT p.fofTreeId, p.treeRootId, p.treeSnapnum, p.mass, z.zred
FROM MDR1.FOFMtree AS p,
  (SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000) AS mycl,
  (SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
AND z.treeSnapnum = p.treeSnapnum
ORDER BY p.treeRootId, p.treeSnapnum

It executes correctly, but takes long (~ 6 min in this case), because PaQu first joins the TreeSnapnum-redshifts with the whole FOFMtree-table; only then it joins with mycl. Thus I put the desired first step into a subquery and wrap the redshift-join around:

SELECT mb.fofTreeId, mb.treeRootId, mb.treeSnapnum, mb.mass, z.zred
FROM 
 ( SELECT p.fofTreeId, p.treeRootId, p.treeSnapnum, p.mass 
    FROM MDR1.FOFMtree AS p,
       (SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000) 
        AS mycl
    WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
 ) AS mb, 
(SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE z.treeSnapnum = mb.treeSnapnum
ORDER BY mb.treeRootId, mb.treeSnapnum

This produces the error: Unknown column 'mb.treeSnapnum' in 'field list'. Here is the full query plan:

-- CALL paquExec('SELECT `fofTreeId` AS `fofTreeId`,`mainLeafId` AS `mainLeafId` FROM MDR1.FOFMtree WHERE ( `fofTreeId` = 100000000 ) ', 'aggregation_tmp_95336591')
-- CALL paquExec('SELECT `p`.`fofTreeId` AS `p.fofTreeId`,`p`.`treeRootId` AS `p.treeRootId`,`p`.`treeSnapnum` AS `p.treeSnapnum`,`p`.`mass` AS `p.mass` FROM MDR1.FOFMtree AS `p` JOIN ( SELECT `fofTreeId`,`mainLeafId` FROM `aggregation_tmp_95336591` ) AS `mycl` WHERE ( `p`.`fofTreeId` BETWEEN `mycl`.`fofTreeId` AND `mycl`.`mainLeafId` ) ', 'aggregation_tmp_82862933')
-- CALL paquExec('SELECT DISTINCT `zred` AS `zred`,`treeSnapnum` AS `treeSnapnum` FROM MDR1.TreeSnapnums ', 'aggregation_tmp_28929384')
-- CREATE DATABASE IF NOT EXISTS spider_tmp_shard
-- USE spider_tmp_shard
-- CREATE TABLE spider_tmp_shard.aggregation_tmp_4510995 ENGINE=MyISAM SELECT `mb`.`p.fofTreeId` AS `mb.fofTreeId`,`mb`.`p.treeRootId` AS `mb.treeRootId`,`mb`.`treeSnapnum` AS `mb.treeSnapnum`,`mb`.`p.mass` AS `mb.mass`,`z`.`z.zred` AS `z.zred` FROM ( SELECT `p.fofTreeId`,`p.treeRootId`,`p.treeSnapnum`,`p.mass` FROM `aggregation_tmp_82862933` ) AS `mb` JOIN ( SELECT DISTINCT `zred`,`treeSnapnum` FROM `aggregation_tmp_28929384` ) AS `z` WHERE ( `z`.`treeSnapnum` = `mb`.`treeSnapnum` ) LIMIT 0
-- USE spider_tmp_shard
-- INSERT INTO spider_tmp_shard.aggregation_tmp_4510995 SELECT `mb`.`p.fofTreeId` AS `mb.fofTreeId`,`mb`.`p.treeRootId` AS `mb.treeRootId`,`mb`.`treeSnapnum` AS `mb.treeSnapnum`,`mb`.`p.mass` AS `mb.mass`,`z`.`z.zred` AS `z.zred` FROM ( SELECT `p.fofTreeId`,`p.treeRootId`,`p.treeSnapnum`,`p.mass` FROM `aggregation_tmp_82862933` ) AS `mb` JOIN ( SELECT DISTINCT `zred`,`treeSnapnum` FROM `aggregation_tmp_28929384` ) AS `z` WHERE ( `z`.`treeSnapnum` = `mb`.`treeSnapnum` ) 
-- CALL paquLinkTmp('aggregation_tmp_4510995')
-- CALL paquDropTmp('aggregation_tmp_95336591')
-- CALL paquDropTmp('aggregation_tmp_82862933')
-- CALL paquDropTmp('aggregation_tmp_28929384')
-- USE spider_tmp_shard
-- SET @i=0
-- CREATE TABLE cosmosim_user_kristin.`2015-01-09-11-10-23-9346` ENGINE=MyISAM SELECT @i:=@i+1 AS `row_id`, `mb.fofTreeId`,`mb.treeRootId`,`mb.treeSnapnum`,`mb.mass`,`z.zred` FROM `aggregation_tmp_4510995` ORDER BY `mb.treeRootId` ASC,`mb.treeSnapnum` ASC 
-- CALL paquDropTmp('aggregation_tmp_4510995')

I don't really understand the error message, since there is no mb.treeSnapnum in the reformulated field lists, it's always mb`.`treeSnapnum` AS `mb.treeSnapnum. Maybe I'm missing something here. If I try something like using mb.treeSnapnum in the final WHERE instead, then I'll get an SQL syntax error: ERROR 1054: Unknown column 'mb.treeSnapnum' in 'where clause'

Current workaround is to alias the columns myself explicitely nearly everywhere, but it's quite some overhead for many columns:

SELECT mb.fofTreeId as fofTreeId, mb.treeRootId as treeRootId, mb.treeSnapnum as treeSnapnum, mb.mass as mass, z.zred as zred
FROM 
 ( SELECT p.fofTreeId as fofTreeId, p.treeRootId as treeRootId, p.treeSnapnum as treeSnapnum, p.mass as mass
    FROM MDR1.FOFMtree AS p,
       (SELECT fofTreeId, mainLeafId FROM MDR1.FOFMtree WHERE fofTreeId = 100000000) 
        AS mycl
    WHERE p.fofTreeId BETWEEN mycl.fofTreeId AND mycl.mainLeafId
 ) AS mb, 
(SELECT DISTINCT zred, treeSnapnum FROM MDR1.TreeSnapnums) AS z
WHERE z.treeSnapnum = mb.treeSnapnum
ORDER BY treeRootId, treeSnapnum

This query is fast and executes correctly. I just wished all the aliasing could be done automatically. Any suggestions to handle this in a better way?