My concerns about _hasDataChanges and the huge performance impact

woutersamaey commented 3 years ago

In Magento, any time you use a setter, the property _hasDataChanges is set to true, even when you're not actually changing anything.

I.e. this code will set _hasDataChanges to true, even though we're not changing anything.

$stockItem->setQty($stockItem->getQty());

This is disturbing me, because it slows down Magento, causing unneeded save and MySQL stuff to run.

I wanted to open this issue as a way to discuss this.

On occasion, I have in fact overridden the setData() method for specific models so _hasDataChanges remains false. In my experience it can greatly improve stock qty import scripts. It's mind blowing how much speed can be gained.

My implementation looks like this:

public function setData($key, $value = null) {
        // In our version, we first check if we are actually changing something. Not setting the _hasDataChanges flag regardless.
        if (!is_array($key) && !is_object($value)) {
            $oldValue = $this->_data[$key] ?? null;
            if ($oldValue === $value) {
                // No need to change anything
                return $this;

            } elseif ($oldValue === (string) $value) {
                // Still no need to change anything
                return $this;

            } elseif (is_numeric($value) && is_numeric($oldValue) && '' . round($oldValue, 8) === '' . round($value, 8)) {
                // The number is the same, with up to 8 decimals accuracy
                return $this;
            }
        }

        return parent::setData($key, $value);
}

and

public function setOrigData($key = null, $data = null) {
        if (is_null($key)) {
            $this->_origData = $this->_data;

            // Added this line because we need to reset the previous "true" value.
            $this->_hasDataChanges = false;

        } else {
            $this->_origData[$key] = $data;
        }
        return $this;
}

Why not make this standard for all models? Yes, it is a bit tricky, but isn't it worth it given the huge speed gain that is possible?

One part of the "tricky" side is that events are also triggered. IMHO I'd say "fuck it". The speed gain is very important and you're not actually saving anything, so why should an event be triggered?

And yes, I am breaking backwards compatibility here.

Preconditions (*)

Any installation since the dawn of Magento

Steps to reproduce (*)

Use any setter on any DB-backed model and set to the same value already there.
Run save() on the model. It will save, even though there is no need for it.

Expected result (*)

Don't save at all, because there is no need, greatly improving speed!

Actual result (*)

It saves the existing data back to the DB like we're used to, also triggering events...

woutersamaey commented 3 years ago

The label "bug" is wrong, but I don't know how to change it.

tmotyl commented 3 years ago

Hi @woutersamaey Ive also wondered the same some time ago. Do you also observe performance improvement in some build in functionality ? Can you share some numbers?

I think that the first condition (values being equal) is safe. Im not sure how safe it would be testing the string cast.

woutersamaey commented 3 years ago

@tmotyl the string case is safe because MySQL treats everything like strings (and we're not supporting other DBs). Objects are also skipped, so there's not threat of __toString() here. I don't have any numbers on hand, and a lot depends on the actual number of saves you're doing.

Example: I've used this extensively when importing stock qties. Huge feeds with 10.000s of products are now practical again and don't force you into pure SQL logic, which has it's own downsides like not triggering events etc.

There's no way all 10.000 products change overnight, but you still need to process the entire feed. If only 100 items change, processing can take like 100 sec instead of 10.000 sec.

You can program around this, by checking the items in _data 1 by 1 before using the setter and avoiding the setter as much as possible, but this is a pain to program every time.

Besides this, there are more cases where Magento sets _hasDataChanges to true right off the bat, even when you're not done anything. An example to this is when you load a product + stock item in 1 go by using $products->setFlag('require_stock_items', true). The product is loaded and an item stock_item is in _data. IN the stock item, the _hasDataChanges is true right afetr the load. So yeah, Magento is also screwing up it's own performance. I'm working on this too.

woutersamaey commented 3 years ago

Updated my implementation by adding my version of setOrigData().

joshua-bn commented 3 years ago

Love the discussion about this.

Took me a minute to see that your code is safe. I missed the !object && !array part.

This would need very extensive testing IMO. Especially due to the issue of event propagation.

jfksk commented 3 years ago

I was just working on a similar problem.

Background: A shop with quite a bit of API traffic and the requirement of update-on-save for some indexes. Diving into the code, I came to a similar conclusion: change tracking is the problem: its either missing or broken. I implemented some change tracking for the most important API endpoints (yes, for now on top of it and not into the models) and for all indexers. The performance gains are pretty good. With all measures taken the overall thru-put increased >~4x (that depends on the number of store-views and the changed data. Some product attributes and stock-data is changed via the API)

Some observations:

Tracking w/ Varien_Object is next to impossible because Magento sets loads of flags that give no indication if the object is actually changed. On the other hand, there are things like Mage_CatalogInventory_Model_Stock_Item::getQtyCorrection which is evaluated on-save.
Mage_Core_Model_Abstract and Mage_Eav_Model_Entity_Abstract do behave differently. Mage_Core_Model_Abstract resets the _hasDataChanges flag on save, while Mage_Eav_Model_Entity_Abstract does not do it.
Mage_Eav_Model_Entity_Abstract does completely ignore the scope of the data changed.
Indexers are another beast. They do index everything that only remotely could be a concern - some no matter if and what needs to be indexed. That makes update-on-save so slow (especially for product-flat). Thats also a HUGE problem w/ the URL indexer when full re-index is performed.

IDK if there is a general solution to fix all of that (I dont think so...) but the bad change tracking is IMHO one of the reasons, why some model saves can be so slow. Its def. doable case by case but that's likely a big BC.
I didnt touch Varien_Object because it just wasnt cutting it for the shop but I am a bit afraid of type juggling (and rounding...) here as it impacts every model in the system

woutersamaey commented 3 years ago

Okay you’re touching on a few more topics. Let’s try to move step by step, like what I’m suggesting.

Rounding errors should not be an issue unless you’re using more than 8 decimal places. Magento only uses pricing up to 4 decimals so should be more than okay.

In my PR I’m fixing a case where all stock items are immediately seen as “changed” right after loading them. This is a baby step.

What’s currently bothering me is that Magento also likes to use setters and setData for all kinds of rubbish that is not part of the model! For example: Magento puts a key ‘product_name’ on the stock model even though this field is inappropriate.

Given some time, I believe we can clean all the cases and build a solid backend.

Some bravery would be required, but that’s exactly why I am building on M1 in the first place. Now that Adobe is out of the way, we can make a truely great piece of software.

woutersamaey commented 3 years ago

Perfect change tracking will indeed be impossible because of many before and afterSave shit that deals with pseudo-fields, but I’m fine with a 4x or more speed gain.

jfksk commented 3 years ago

@woutersamaey yes that was a bit much - I wanted to clarify where I am coming from ;) The 4x speed improvement is largely down to the now not running indexers. I did not impl. it for any import case but it prob. in the same ballpark compared to what you observed. I do not have any isolated numbers but noticed a big improvement when adding change tracking to the URL index because it does not change most of the time.

I feel a bit uneasy about impl. complex checks into Varien_Object because its used everywhere and its hard to tell what people are using it for. I couldn't tell if for example rounding wouldn't cause problems to somebody now, or down the line. But I have no concrete case... Maybe not...

Agreed that its not realistically possible to get rid of all flags and field that are not actually saved. As you said, only some are just "chatter". What I did, for example, with the stock-item is to compare origData and Data while incorporating all db-fields and qty_correction and ignoring all the rest. That works in my specific cases no matter how the stock-item was created but I'm not sure if that would work in all cases. Prob. not... I think a general (whats reflected in the db) and a specialized (what else triggers a needed save) case is needed model by model to truly get around needless saves (and indexations in a second step). That could be done step by step. For example first only in general for models that are bound to a table and specifically for the stock item. With a proper interface it could also be used for indexation.

Thats why I'd leave the Varien_Object alone (or maybe ad a strict check for good measure). But its A LOT harder to properly do it model by model...

kiatng commented 3 years ago

Reference Mage_Core_Model_Abstract::save():

    /**
     * Check whether model has changed data.
     * Can be overloaded in child classes to perform advanced check whether model needs to be saved
     * e.g. usign resouceModel->hasDataChanged() or any other technique
     *
     * @return boolean
     */
    protected function _hasModelChanged()
    {
        return $this->hasDataChanges(); // My note: which simply returns $this->_hasDataChanges.
    }

    /**
     * Save object data
     *
     * @return $this
     */
    public function save()
    {
        /**
         * Direct deleted items to delete method
         */
        if ($this->isDeleted()) {
            return $this->delete();
        }
        if (!$this->_hasModelChanged()) {
            return $this;
        }
//...

IMHO, it's more robust to skip saving by having protected function _hasModelChanged() in the model class, as original intended per comment:

Can be overloaded in child classes to perform advanced check whether model needs to be saved

However, the search shows only 4 models that implemented it.

colinmollenhour commented 3 years ago

There are many changes that could be made here.. I think looking for the least invasive first we might consider just changing specific areas that are strongly impacted by meaningless saves. For example:

$model->setSomeId($some->getId()); // bad

if ($model->getSomeId() != $some->getId()) { // good
    $model->setSomeId($some->getId());
}

What is the particular stack trace for the case you were referring to?

Also, the setData method is called a lot so I think adding a lot of logic to it could have a significant negative impact. It would probably be better to do the comparison at the time of the save rather than in the setter, for example in _prepareDataForSave, but this would be pretty invasive. This could also be used to have only the changed columns passed to the UPDATE statement rather than all columns as is currently the case. Since Magento almost never uses exclusive row locks this would also reduce race conditions (e.g. one process is updating column A and another is updating column B, but they both conflict because all data is rewritten instead of just the changed data).

joshua-bn commented 3 years ago

@colinmollenhour by doing the check at save, you remove the ability to force a save. That's a good feature to keep.

In the grand scheme, doing a comparison in PHP is extremely fast.

colinmollenhour commented 3 years ago

Here is a great example: in Mage_Sales_Model_Order when loading the items collection (which is very often) it will loop over the items calling setOrder:

    public function setOrder(Mage_Sales_Model_Order $order)
    {
        $this->_order = $order;
        $this->setOrderId($order->getId());
        return $this;
    }

So, just the act of loading an order and the items collection and then calling save will result in every order item being overwritten again. This can easily be fixed like so:

    public function setOrder(Mage_Sales_Model_Order $order)
    {
        $this->_order = $order;
        if ($this->getOrderId() != $order->getId()) {
            $this->setOrderId($order->getId());
        }
        return $this;
    }

This avoids the _hasDataChanges flag being set so the order items will not be updated.

colinmollenhour commented 3 years ago

@colinmollenhour by doing the check at save, you remove the ability to force a save. That's a good feature to keep.

I've used this method in a very large project and never needed to force a save on a model that has no changes but if you need to it can be done easily:

$model->setDataChanges(true); // force model to be considered dirty
$model->setOrigData([]); // force all data fields to be considered "dirty"

In the grand scheme, doing a comparison in PHP is extremely fast.

Doing one, yes, but doing 5 N M can add up fast (e.g. 1000 product models each with 50 attributes is 250k comparisons).

joshua-bn commented 3 years ago

250k comparisons is still very little. If you're loading 1000 products with 50 attributes each, the tiny fraction of a second it will take to check if the attributes are the same is negligible.

But, I don't think it's worth debating. I am in favor of pretty much anything that does less database IO here.

midlan commented 3 years ago

I spoted the same thing few weeks ago. I wanted to open the same issue here :)

I was also checking how they deal with that in Magento2 and it looks they fixed it.

fballiano commented 2 years ago

Hi everybody, what could we decide about this issue? Is PR https://github.com/OpenMage/magento-lts/pull/1307 still relevant or we'll implement something different?

colinmollenhour commented 2 years ago

@fballiano I pushed PR #2066 which doesn't necessarily preclude the proposed changes in this task but I think does mostly address them with what I think should be the least invasive set of changes and the least likely to have a negative impact on performance (very positive in some cases where full saves are avoided).

OpenMage / magento-lts