getavalon / core

The safe post-production pipeline - https://getavalon.github.io/2.0
MIT License
218 stars 48 forks source link

Hierarchical data #410

Open BigRoy opened 5 years ago

BigRoy commented 5 years ago

Issue

Allow a way of easily implementing the setting or getting of "hierarchical data". There are two approaches to this problem:

  1. Set downstream: Whenever setting the data make sure it's set and updated downstream. As such children assets would inherit the changes made on parents. This would allow slightly heavier write changes at those occasions but far more optimal queries, as the asset itself would fully encapsulate the data. In a MongoDB design this is called "embedding" the data as opposed to linking by reference.

  2. Query upstream: Whenever the asset is lacking a specific piece of data query upstream for that data. This could introduce quite some more querying which could turn out slow when querying the data for a lot of asset (also making optimized queries for that slightly more involved). - However, that could be optimized.

Example for point 1. Setting values downstream

Setting data downstream is trivial.

from avalon import io

def set_data_hierarchical(id, data, traversal="data.visualParent"):
    """Update the asset's .data including downstream children

    Args:
        id (io.ObjectId): Database document id.
        data (dict): The key,values to update.
        traversal (str): The database data key to use 
            to traverse downstream. Defaults to 
            data.visualParent

    Returns:
        None

    """

    assert isinstance(data, dict), "data must be dict"

    document = io.find_one({"_id": id, "type": "asset"})
    document["data"].update(data)
    # Save back into database
    io.save(document)

    # Traverse downstream
    for child in io.find({traversal: id, "type": "asset"}):
        set_data_hierarchical(child["_id"], 
                              data=data,
                              traversal=traversal)

However, it's good to understand when to set it or not, e.g. a child could already have an override of its own which you'd want to keep. I believe this is best addressed with a good UI design where it's easy for the users to decide and tweak what children will get updated too, also noted here

Example for point 2. Querying data upstream and optimizing it.

A simple example of a hierarchical upstream query can be found here where the FPS is queried on the asset and when not found is queried from the project.

A potential optimization when querying for hundreds of assets in one go (including upstream traversal) one could potentially keep around a temporary cache for data of the assets/parent assets.

# psuedocode that is not implemented currently
from avalon.io

with io.cached_queries():
    # Here whenever a query is done through `io.find_one` or 
    # `io.find` the result is cached based on *args+**kwargs
    # passed to the functions. If there's a cache hit the
    # previously retrieved value would get returned as opposed
    # to the database getting queried.

    # First query in this context, so no cache.
    a = io.find_one({"id": custom_id_a})

    # Second query with the same values will hit the stored
    # cache value and will not result in a hit to the database
    a = io.find_one({"id": custom_id_a})   # no database query

    b = io.find_one({"id": custom_id_b})   # database query

    # As such whenever you query upstream for tons of assets
    # any previous hits would not be database queries.

An example of caching functions with decorators I guess can be found here and also described here


Reference:

mkolar commented 5 years ago

Both options are very viable in our case I think, however we need to have a think about what situations this would be most used in. From the discussion in #406 it seems this applies mostly if not only to asset data. That mean that the only time where retrieval of many assets at the same time would only be applicable if building a UI that woudl try to show all the information at once. Please correct me if you can think of other bulk uses as I can't seem to think of any.

The only ones I can think of. always operate on a single asset, like setting these attributes in work scenes and during publish, which operate on a single asset in vast majority of cases. Hence the speed (even though it should certainly be optimised by caching) doesn't sound like the core consideration here. Querying the data from parents would however be a lot more flexible without the need for extra tools for continuous distribution of the attributes from top to bottom.

Setting data downstream is trivial.

technically yes, architecturally to make it behave as expected, not really. Figuring out overrides only based on differing values between parent and child is not robust enough (we learnt this the hard way with a client). We had the same idea and it was our first implementation of hierarchical attribute sync from ftrack to avalon. However it turns out that quite common situation is setting an attribute on a project (get's inherited by all), but later wanting to change the project apart from some assets that you want to freeze on their current values. If you however explicitly set all the attributes on all entities you no longer posses the information about which should be changed and which should be frozen when updating from the top.

Taking the other approach (querying from parents), you always know that if value doesn't exists, it should be inherited from parent, and if it does, that it's the one to use, whatever it might be.

There of course also the option to make these attributes a bit more complex and introduce some kind of metadata that denotes whether the attribute has an override or not. I'm not a big fan of that though.

mkolar commented 5 years ago

Another question to tackle is how to differentiate between standard and hierarchical attributes. We cannot do this for all data as that would potentially introduce tons of false negative, and false positives in the system.

The easy option of course is making a list of all of them somewhere in the system and everytime when querying and attribute checking against this list to decide what to do with it. Not sure where would be the best place to put the information.

davidlatwe commented 5 years ago

I would vote option 2, since it's much straight forward and simple, but without the optimization.

A bit worry about that io.cached_queries, if we only allowed to use _id to differ caches then I think we are safe. But if not, then might need to be careful when caching nested document queries because different field orders may lead to different results.

For example:

>>> collection.insert_one({"x": {"a": 5, "b": 8}})
>>> collection.insert_one({"x": {"b": 8, "a": 5}})
>>> # Two different field order documents inserted.
>>> # Query by native dict
>>> for d in collection.find({"x": {"b": 8, "a": 5}}):
...     print(d)  # Returned only one document
... 
{'_id': ObjectId('5d4937a0bf1c25dbe8bdd20e'), 'x': {'b': 8, 'a': 5}}
>>> # Query by field path
>>> for d in collection.find({"x.b": 8, "x.a": 5}):
...     print(d)  # Returned both documents
... 
{'_id': ObjectId('5d493796bf1c25dbe8bdd20d'), 'x': {'a': 5, 'b': 8}}
{'_id': ObjectId('5d4937a0bf1c25dbe8bdd20e'), 'x': {'b': 8, 'a': 5}}
>>> # However in Python
>>> {"a": 5, "b": 8} == {'b': 8, 'a': 5}
True
BigRoy commented 5 years ago

A bit worry about that io.cached_queries, if we only allowed to use _id to differ caches then I think we are safe. But if not, then might need to be careful when caching nested document queries because different field orders may lead to different results.

That would only be a valid workflow if you'd never filter to specific data and always query the full asset.

>>> collection.insert_one({"x": {"a": 5, "b": 8}})
>>> collection.insert_one({"x": {"b": 8, "a": 5}})

Are you sure this is doing what you're expecting it is doing? :)

print {"a": 5, "b": 8}
print {"b": 8, "a": 5}

They are exactly the same - you think you're creating differently ordered collections, but you actually aren't. Or aren't you running this in Python? A dictionary is unsorted in Python.

{'_id': ObjectId('5d493796bf1c25dbe8bdd20d'), 'x': {'a': 5, 'b': 8}}
{'_id': ObjectId('5d4937a0bf1c25dbe8bdd20e'), 'x': {'b': 8, 'a': 5}}

This I think you would never get as a result, right? Or at least in Python with pymongo? Isn't it a regular dictionary, and thus has no specific order that you can rely on.

I think you might not be referring to Python code.


Ah, it is possible to get it sorted if you really needed it to: https://emptysqua.re/blog/pymongo-key-order/

However, note that you're never actually querying based on the sorting as far as I know with Pymongo. As such, the arguments can just be cached as you'd like. Additionally, even if it's a false negative cache it then it's just less optimal - but I don't see how you might get false positives. If you pass the function the same arguments and ignore database changes in the meantime then the result must be the same as you passed before, thus you can take it from cache. Correct?

davidlatwe commented 5 years ago

Ah, sorry. Forgot to mention it's the behavior in Python 3.6+.

You are right if using Python 2.7, but starting from Python 3.6, the dict has insertion ordered.

So if using PyMongo in Python 3.6+, those two documents actually preserved the order when inserting them into Mongo, and Mongo will see them as different document if you query with full nested dict.

But this should be some edge-cases, just might happen though.

BigRoy commented 5 years ago

but starting from Python 3.6, the dict has insertion ordered.

Thanks, good to know!

However, looking at your code isn't the final result still as expected? Because the query against the pymongo database doesn't care about the sorting of the keys. It just queries whether they are there. So, if we can somehow just allow the cache lookup to understand it was the same dict (even if differently sorted, similar as to how dict_a == dict_b) then the resulting queries would still make sense when cached. Even more, it would just return the original dict as per the previous query, so the sorting from the database would still come through as expected. Correct?

I'm not seeing yet where it would actually break the functionality.

To be entirely clear, the caching as described is solely for querying data.

iLLiCiTiT commented 5 years ago

Why not to use stupidly simple dictionary to store them? Dictionary that will store _id as key and entity object as value so you can reuse them when necessary.

Result: entities won't be queried multiple times and dictionary won't affect next queries with old data

Also can be modified that object entity data may store result downwards so finding hierarchical values for each next entity will be even faster.

davidlatwe commented 5 years ago

However, looking at your code isn't the final result still as expected?

No no, not if you querying with find({"data": {"foo": "bar", "subDiv": 1}}), and the previous inserted document was {"data": {"subDiv": 1, "foo": "bar"}}, you will got nothing in Python 3.6+. One must using {"data.foo": "bar", "data.subDiv": 1} to query to ensure the result.

So, if we can somehow just allow the cache lookup to understand it was the same dict (even if differently sorted, similar as to how dict_a == dict_b)

Different insertion ordered dict will still be equal in Python 3.6+ :D

Even more, it would just return the original dict as per the previous query, so the sorting from the database would still come through as expected. Correct?

You know what, you were right ! Since it's cached, no matter how the nested fields' order changed during that io.cached_queries() context, if they were equal inputs, then it will re-use the previous result from cache. Just as simple as that ! I probably stuck in a tree hole or ... Anyway, thanks and sorry about that :P


Why not to use stupidly simple dictionary to store them? Dictionary that will store _id as key and entity object as value so you can reuse them when necessary.

I think by caching function args, we won't need to first knowing what the _id was so we can cache the document. Which means we could directly caching queries like find_one({"type": "project"}).

But I think it needs to be limited to only cache query function like find or find_one. (Maybe it's a bit off topic now ?)

BigRoy commented 5 years ago

No no, not if you querying with find({"data": {"foo": "bar", "subDiv": 1}}), and the previous inserted document was {"data": {"subDiv": 1, "foo": "bar"}}, you will got nothing in Python 3.6+. One must using {"data.foo": "bar", "data.subDiv": 1} to query to ensure the result.

WHAT!? That's absurd! :) But hell, it's only when you query it explicitly as a nested dictionary (admittedly I had never even done that, lucky me!)

Anyway, good to know. But then actually it's correct that in that case it results in a "cache miss" so the cache itself will remain behaving as expected (from how Pymongo should behave in that 3.6+ scenario).