clean_value: test whether sufficient for serialization/deserialization

dev-zero commented 5 years ago

the idea of the of the clean_value method is to prepare data such that it is invariant under DB/JSON serialization/deserialization which is important to ensure that the generated hash is the same when loading data from the DB again

dev-zero commented 5 years ago

see also #2038

sphuber commented 5 years ago

I worked a bit on this with @giovannipizzi tonight and pushed it to a branch. The first steps we took is to check the behavior of the float truncation. We added some more tests that were failing with the default number of bits of 4 in the mask. We increased it to 12 for now, but there is still one example that is failing. This most likely has to do with the representation of the floats in base 2 being peculiar. This is to further be investigated. Tomorrow we will continue and when truncation is well understood we can move on to finish the tests for clean_value to make sure that (de)serialization round-trip yields identical hashes. A place holder test with some initial test data is already added and will have to be extended.

szoupanos commented 5 years ago

Thanks for this. I also looked at this a bit yesterday and I have some questions:

What should we do in the case of a numpy array? How should we compute the hash?
Should the hash be the same before and after the storage to the database?
Are there any other cases like numpy?

Can I find somewhere all the commits related to this issue? Let's sync as soon as everyone is here to divide the work and close this issue fast ;)

dev-zero commented 5 years ago

@szoupanos for a numpy array the hash is calculated by its little-endian row-major byte-representation of the truncated (by bit-mask) values. See https://github.com/aiidateam/aiida_core/blob/dfc6037fc0615219939547e602a98b48305ec73b/aiida/common/hashing.py#L328-L347

Yes, the hash should be the same, therefore the whole clean_value thing: this function should yield something which is under hash the same thing as when deserializing from the database. In mathematical terms: hash(clean_value(attribute)) == hash(from_db(to_db(clean_value(attribute)))) or clean_value should map everything to a fixed point of from_db⚬to_db

Yes, there are likely more cases, depending on what kind of libraries people will be using and whether those types will end up in an attribute. They should be caught by the general hashing for Mappings and Sequences though, which should again be consistent (but possibly not as performant). The behavior of the serialization and deserialization might be a bit surprising though (hence this issue) in that the types will be silently converted to native Python types.

giovannipizzi commented 5 years ago

@szoupanos the commits (only one now) are in the branch pointed out by @sphuber

szoupanos commented 5 years ago

Just a small comment to say that that the numpy array can not be stored in the extras because it is not "json-serializable"

E.g.

def test_types_numpy(self):
    from aiida.orm import Node, load_node
    import numpy

    nd = numpy.array([1, 2, 3])
    test_data = {
        'numpy': nd,
    }

    print("test_data ===> ", test_data)

    for key, value in test_data.items():
        with self.subTest(key=key):
            node = Node()
            node._set_attr(key, value)
            node_hash = node.get_hash()
            print("node_hash ===> ", node_hash)
            node.store()
            node2 = load_node(node.uuid)
            node2_hash = node2.get_hash()
            print("node2_hash ===> ", node2_hash)
            self.assertEqual(node_hash, node2_hash)

(aiidapy3) aiida@ubuntu-aiida-vm1:~/aiida-code/aiida_core$ verdi -p test_dj_p3 devel tests db.clean_value
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>> Tests for django db application
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stest_data ===>  {'numpy': array([1, 2, 3])}
node_hash ===>  309c448af267f31f7b7857adc6b978fa09a705e6491717a4da15e94785a9aec3

======================================================================
ERROR: test_types_numpy (aiida.backends.tests.clean_value.TestCleanValueConsistency) (key='numpy')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/aiida/aiida-code/aiida_core/aiida/backends/djsite/db/models.py", line 852, in create_value
    jsondata = json.dumps(value)
  File "/usr/lib/python3.5/json/__init__.py", line 230, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.5/json/encoder.py", line 198, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.5/json/encoder.py", line 256, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.5/json/encoder.py", line 179, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 1 is not JSON serializable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aiida/aiida-code/aiida_core/aiida/backends/tests/clean_value.py", line 95, in test_types_numpy
    node.store()
  File "/home/aiida/aiida-code/aiida_core/aiida/orm/implementation/general/node.py", line 1693, in store
    self._db_store(with_transaction)
  File "/home/aiida/aiida-code/aiida_core/aiida/orm/implementation/django/node.py", line 650, in _db_store
    with_transaction=False)
  File "/home/aiida/aiida-code/aiida_core/aiida/backends/djsite/db/models.py", line 1141, in reset_values_for_node
    subspecifier_value=dbnode_node,
  File "/home/aiida/aiida-code/aiida_core/aiida/backends/djsite/db/models.py", line 828, in create_value
    subspecifier_value=subspecifier_value))
  File "/home/aiida/aiida-code/aiida_core/aiida/backends/djsite/db/models.py", line 854, in create_value
    raise ValueError("Unable to store the value: it must be "
ValueError: Unable to store the value: it must be either a basic datatype, or json-serializable

Of course it can be stored after a .tolist() conversion. I just mention it because we had a discussion that it is 'silently' converted to a list by AiiDA while stored.

dev-zero commented 5 years ago

so, the extras are not passed through clean_value then

szoupanos commented 5 years ago

I also add here some notes that I kept after a discussion that I had with @lekah regarding accuracy, truncation and hashing (about a different issue - accuracy of floats in Django and SQLA and correct hashing of values - but it is very related to the hashing of floats that we discuss here).

In a few words it demonstrates why truncation is not enough to solve accuracy problems. We need some kind of "rounding".

Example of problematic hashing
(rounding problem)

2 numbers 

SQLA
In [17]: a
Out[17]: 1.6065105000000006

In [24]: '{:.{precision}f}'.format(a, precision=5)
Out[24]: '1.60651'

In [18]: '{:.{precision}f}'.format(a, precision=6)
Out[18]: '1.606511'

In [19]: '{:.{precision}f}'.format(a, precision=7)
Out[19]: '1.6065105'

In [20]: '{:.{precision}f}'.format(a, precision=8)
Out[20]: '1.60651050'

In [21]: '{:.{precision}f}'.format(a, precision=14)
Out[21]: '1.60651050000000'

In [22]: '{:.{precision}f}'.format(a, precision=17)
Out[22]: '1.60651050000000062'

In [23]: '{:.{precision}f}'.format(a, precision=20)
Out[23]: '1.60651050000000061857'

Django
In [18]: a
Out[18]: 1.6065105

In [25]: '{:.{precision}f}'.format(a, precision=5)
Out[25]: '1.60651'

In [19]: '{:.{precision}f}'.format(a, precision=6)
Out[19]: '1.606510'

In [20]: '{:.{precision}f}'.format(a, precision=7)
Out[20]: '1.6065105'

In [21]: '{:.{precision}f}'.format(a, precision=8)
Out[21]: '1.60651050'

In [22]: '{:.{precision}f}'.format(a, precision=14)
Out[22]: '1.60651050000000'

In [23]: '{:.{precision}f}'.format(a, precision=17)
Out[23]: '1.60651049999999995'

In [24]: '{:.{precision}f}'.format(a, precision=20)
Out[24]: '1.60651049999999995244'

szoupanos commented 5 years ago

I don't know if the Decimal module could help to round the number before hashing https://blog.udemy.com/python-round/

giovannipizzi commented 5 years ago

@dev-zero @szoupanos

so, the extras are not passed through clean_value then

Well, Spyros said extras but I think he meant (from the test, and from the fact extras are not hashed) attributes. The question (maybe @szoupanos you can investigate?) are

why there is an exception inside another exception
was clean_value called?
why we get an error from the ORM and not from clean_value

giovannipizzi commented 5 years ago

I don't know if the Decimal module could help to round the number before hashing https://blog.udemy.com/python-round/

Interesting module! Another possible idea I had is to represent the number as a fraction of integers (there is a module to do that) with some specified rules, e.g. the precision to which you want to approximate it etc., and hash the two integers rather than the float.

It might work, but I'm not 100% sure, what do you think?

The reason I'm a bit skeptical is mainly because these modules might be useful if we were to stay inside of python, but we then have a serialization/deserialization that is almost completely out of our control (and might change over time, we might have more in the future, ...) so some careful thought is needed to see if these approaches solve our problem.

Also, as a side note: of course we want correctness first, but let's keep an eye on the cost of computing these hashes at some point.

szoupanos commented 5 years ago

I was thinking a bit today about this and I think that the rounding will always (?) have an issue. Which I don’t that it is such a big deal. We will just have a few hash misses of some float keys (except if we use as a key 1k of floats).

But let’s continue. What we do is the following:

a1 (Python level) -> a2 (DB level) -> a3 (Python level)

And we want a1 to be equal to a3 by doing some rounding. However, whatever we do, it will be wrong at corner cases - I tend to believe that what we saw at the truncation example showed above, we will see on other rounding methods because they have the same flaw: they try to give you an exact number where the small noise introduced at a2 may be enough to change the result.

What we would like to do is abs(a1-a3) < e. However, since we have a hash table and a hash function we cannot do range comparisons. However, what we can do is to hash the limits of this range (I think that Giovanni mentioned something like this or exactly this the previous days).

E.g. imagine having number n1 which becomes n3

we could hash n1 with the ceil(n1) and floor(n1) and then lookup with the ceil(n3) and floor(n3). We hope that the error e introduced at the DB level will not be large enough to make the ranges [floor(n1), ceil(n1)] and [floor(n3), ceil(n3)] not to overlap.

But with the right precision for the ceil & floor methods we should be OK. [Of course when I mention ceil and floor above, I refer to methods that will give the ceiling and the floor of a real number to another real number with a predefined number of decimal digits]

The problem of this approach is that we double the key entries and the lookups.

Ideas? Comments?

sphuber commented 5 years ago

You would double the number of hashes for each node with a float However, if I understand correctly, if you have a ParameterData with multiple floats, wouldn't you have a combinatorial number of hashes? In any case, I think the point by @giovannipizzi that we should not forget why we are trying to do this is an important one. The goal of caching is to make life easier, if it is not perfect when it comes to floats, and users are aware of this, maybe this is an acceptable compromise, given that a solution would be too expensive in other terms.

szoupanos commented 5 years ago

I agree. Maybe it is better to sit down 10 mins and discuss how the whole hashing and caching mechanism works and what we want to achieve (the last part is clear enough).

To be more precise. Hashing (in general) and a hash table in general allows you to "jump fast" to a bucket of probable results. This means that if a result (or a match exists) a correct hash function should point you to the correct bucket where you will search for it in any way you want (sequentially or with any other mean). So the h(x)=1 is a correct, but totally inefficient, hash method.

Having said the above, my understanding is that we don't have to hash every parameter of an object searched. We just have to hash such parameters that will allow us to find the same result if it exists (and should avoid parameters that will lead us to no results when such results exists - e.g. the float example in its simple form at least).

Apart from hashing, there should be an equality check mechanism which is the one which should do the "dirty" and difficult job.

In order not to be short: The hashing mechanism should be correct and "efficient enough" (so not every parameter has to be hashed) and the final equality check should be complete and correct. Since I am not an expert on how the caching is implemented, let's chat a bit today about this in order to understand better what we need to solve.

sphuber commented 5 years ago

After discussion during the group meeting: this issue was not introduced by the python3 compatibility work and it is currently blocking finalizing that project. I am removing this issue from that project

giovannipizzi commented 5 years ago

Please check also comments in issue #586 about precision in different backends (I'm closing the other issue, but relevant info is in there)

giovannipizzi commented 5 years ago

@sphuber have we lost the branch?

sphuber commented 2 years ago

Since we haven't heard of any problem related to this, I am going to close this

aiidateam / aiida-core

clean_value: test whether sufficient for serialization/deserialization #2124