atk4 / data

Data Access PHP Framework for SQL & high-latency databases
https://atk4-data.readthedocs.io
MIT License
273 stars 46 forks source link

Duplicating non-stable field expr. in query - is completely wrong and can produce wrong results #682

Open mvorisek opened 4 years ago

mvorisek commented 4 years ago

Of course, I am thinking about optimal data modelling, not about implementation. Implementation is always possible.

Because of separation of Model and Persistence. Each persistence handles the JOIN in a child class

that is the same as with Union - we need universal nesting support from Model which could result in join, union, wrap.

The current design is actually completely wrong as presented a few days ago: Imagine field with expr RAND() > 0. This field is currently rendered multiple like:

SELECT RAND() > 0 FROM DUAL WHERE RAND() > 0

it is not only rand, any expression like SELECT name FROM t LIMIT 1 is unstable as it has no explicit ORDER. Even if we (in theory, we normally do no even know them) add all table columns as a fallback order, queries like SELECT name, COUNT(*) FROM t ORDER BY name LIMIT 1 will be still unstable because of grouping first.

This is major bug which must be addressed.

Any non-trivial expression must SELECT under an alias ONCE and then this alias must be used anywhere else - https://stackoverflow.com/questions/35695171/remove-duplicate-sub-query

DarkSide666 commented 4 years ago

This is important to think about and also this is not that simple to change and fix. That's because there can be anything in expression field. It can be subquery, some fieldname, function call etc.

What resulting select you would like to see instead of this ?

-- foo = field expression: RAND()
SELECT RAND() as foo FROM DUAL WHERE RAND() > 0
mvorisek commented 4 years ago
-- foo = field expression: RAND()
SELECT RAND() as foo FROM DUAL WHERE RAND() > 0

should ALWAYS be queried like

SELECT RAND() as foo FROM DUAL HAVING foo > 0

non trivial fields (or to be precise fields that are non-constant and not using constant functions) that are used multiple times must be firstly selected and then reused/conditioned using HAVING

sql render using nesting is wrong, we must render to some intermediate form and then build sql once

DarkSide666 commented 4 years ago

Nested queries should be much more faster that using having which executes on full result-set. On other hand, in some cases (like mentioned above) using having is a solution.

But again, it will perform much slower. For example, this simple case:

SELECT
    (select name from person where id=e.person_id) AS emp_name
FROM employee e
WHERE 
    (select name from person where id=e.person_id) = 'John'

This is probably faster than using having in case we have millions of employees.

Even more complexity shows up if we use both - where and having in same query. And what if conditions are nested?

mvorisek commented 4 years ago

@DarkSide666 this should use join if 1:1, if 1:N, your query will fail (bacause subquery will return more than 1 row)

MySQL and other DBs like Oracle or MSSQL engines can probably optimize some HAVING clauses the same like subqueries

nested conditions - you must query data and then you can use constant functions/junctions/operators like you want

DarkSide666 commented 4 years ago

Yes, this example is for simple hasOne case and standard title_field.

mkrecek234 commented 1 year ago

Can confirm that - if you have more complex hasOne/hasMany table structures, Atk4/Data gets very slow in case you add fields to child model from parent, grand-parent and grand-grand-parent tables via hasOne->addFields(...). This requires optimization, otherwise for performance reason it is much faster to load parent/grand-parent tables only where needed for single records, rather than adding those (grand-)parent fields straight into the model. If we render SQL correctly, this should not slow down SQL performance though.

mvorisek commented 1 year ago

@mkrecek234 can you please post here a repro code like in https://github.com/atk4/data/issues/1078#issuecomment-1357370201 for your nested usecase?

mkrecek234 commented 1 year ago

@mvorisek Here you go - sorry not designed as a test but normal code:

<?php

use Atk4\Data\Persistence;
use \Atk4\Data\Model;
use \Atk4\Data\Model\AggregateModel;

include '../vendor/autoload.php';

$db = new \Atk4\Data\Persistence\Sql('mysql:host=localhost;dbname=atk4;port=3306;user=root;password=root');

/* Create database 'atk4' in localhost MySql
 *

CREATE TABLE `account_group` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(32) DEFAULT NULL,
  `sign` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `account_group` (`id`, `name`, `sign`)
VALUES
    (1, 'Acount Group', -1);

CREATE TABLE `account` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(32) DEFAULT NULL,
  `account_group_id` int unsigned DEFAULT NULL,
  `fx_rate` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `account` (`id`, `name`, `account_group_id`, `fx_rate`)
VALUES
    (1, 'Test Account', 1, 1.2);

CREATE TABLE `booking` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `account_id` int unsigned DEFAULT NULL,
  `booking_group` varchar(32) DEFAULT NULL,
  `value` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `booking` (`id`, `account_id`, `booking_group`, `value`)
VALUES
    (1, 1, 'Group1', 123),
    (2, 1, 'Group1', 456),
    (3, 1, 'Group2', 789);

 */
class AccountGroup extends \Atk4\Data\Model {
    public $table = 'account_group';

    protected function init(): void
    {
        parent::init();
        $this->addFields(['name', 'sign' => ['type' => 'float']]);;
    }
}
class Account extends \Atk4\Data\Model {
    public $table = 'account';

    protected function init(): void
    {
        parent::init();
        $this->addFields(['name', 'fx_rate' => ['type' => 'float']]);
        $this->hasOne('account_group_id', ['model' => [AccountGroup::class]])->addField('sign');
    }
}

class Booking extends \Atk4\Data\Model {
    public $table = 'booking';

    protected function init(): void
    {
        parent::init();
        $this->addFields(['booking_group', 'value' => ['type' => 'float']]);
        $this->hasOne('account_id', ['model' => [Account::class]])->addFields(['fx_rate', 'sign']);

    }
}

    $accountModel = new Account($db);
    $bookingModel = new Booking($db);

    $bookingAggregate = new AggregateModel($bookingModel);
    $bookingAggregate->setGroupBy(['booking_group'], [
        'total' => ['expr' => 'sum([value])'], ]);

    foreach ($bookingAggregate as $booking) {
        echo 'Result: ' . $booking->get('booking_group') . ' | ' . $booking->get('total');
    }

    echo 'That worked..';

    $bookingAggregate = new AggregateModel($bookingModel);
    $bookingAggregate->setGroupBy(['booking_group'], [
        'total' => ['expr' => 'sum([sign] * [value])'], ]);

    foreach ($bookingAggregate as $booking) {
        echo 'Result: ' . $booking->get('total');
    }
    echo 'That did not work..';
mvorisek commented 1 year ago

it seems you somewhat copied the AggregateModel testcase from aggregate issue

subquery like:

      (
        select
          `fx_rate`
        from
          `account` `_a_8a089c2a7e6c`
        where
          `id` = `booking`.`account_id`
      ) `fx_rate`

should be quite fast as long as the selection is done on unique index, any reasonable SQL optimizer should detect it is a 1:1 join query

In chat you have written:

Just sonething I noticed: When you have multi-layered hasOne relations (so parents tables, grand-parent tables and grand-grandparents tables) and if you always addFields of the respective higher levels, so in your child table you have a lot of fields from grand/grand/grand-tables, Atk4/data gets terribly slow. I stopped doing this and even load fields later, instead of adding them theough hasOne relations. In SQL this would not be so slow IMHO, because you do: SELECT child.field, parent.field, grand-parent.field from child join parent on (…) join grand-parent on (…) join grand-grand-parent on (…) Atk4/data seems to create separate sub-query selects for each added field.

It would be good if you can provide the actual slow query and table metrics

mkrecek234 commented 1 year ago

@mvorisek The example above shows a nesting, as we cascade the field down from AccountGroup via Account to Booking. If this was not what you meant, please specify what you need. On the performance I do not have performance metrics, but I can tell that complex models with down-cascaded fields slow down simple load and save model commands to a good extent.

mvorisek commented 1 year ago

so no AggregateModel above is needed and (new Booking($db))->executeCountQuery() has the same performance problem?

mvorisek commented 11 months ago

⚠ even more conterintuitive: https://sqlite.org/forum/forumpost/8202db3916