MarshalX / python-libipld

🏎️ Fast Python library to work with IPLD: DAG-CBOR, CID, CAR, multibase
https://pypi.org/project/libipld/
MIT License
17 stars 2 forks source link

Duplucate and non-canonically ordered map keys are accepted #15

Closed DavidBuchanan314 closed 8 months ago

DavidBuchanan314 commented 8 months ago

Here are 3 test cases:

import libipld

# equivalent to {"abc": 1, "abc": 2} (duplicate key)
print(libipld.decode_dag_cbor(bytes.fromhex("a263616263016361626302")))

# equivalent to {"def": 1, "abc": 2} (same key lengths, wrong sort order)
print(libipld.decode_dag_cbor(bytes.fromhex("a263646566016361626302")))

# equivalent to {"aaa": 1, "x": 2} (different key lengths, wrong sort order)
print(libipld.decode_dag_cbor(bytes.fromhex("a26361616101617802")))

They all decode successfully, but they should fail, per the spec. (hashberg's dag_cbor, and my own cbrrr, fail as expected)

The first is equivalent to the negative test case you can find here https://github.com/ipld/codec-fixtures/blob/d93d0de6eff0cf2c2efff010c6b2a79aaeb7d100/negative-fixtures/dag-cbor/decode/duplicate-keys.json (I should probably PR the other two test cases there!)

I believe this is an upstream issue with https://github.com/ipld/serde_ipld_dagcbor but I don't know enough rust to make a neat test case for it, so I'm hoping you can help me move this up the chain :)

MarshalX commented 8 months ago

yeeeah. i recently read specification of dag-cbor. the order of keys it not strict anymore in maps. but you are right the error must be throw on key duplication. i have todo in the code for it xd

https://github.com/MarshalX/python-libipld/blob/59cf00003b3002c2c9f098f5223e406470f1f638/src/lib.rs#L82

moreover my current impl allows map keys not only for strings types. i'll change it soon too. and i hope it will give some small perf boost too

MarshalX commented 8 months ago

btw i have in my todo list adding of https://github.com/ipld/codec-fixtures/tree/master/fixtures for the tests. i already have some from popular json benchmarks https://github.com/MarshalX/python-libipld/tree/main/data

DavidBuchanan314 commented 8 months ago

I'm pretty sure key ordering is still strict? https://ipld.io/specs/codecs/dag-cbor/spec/#strictness

"The keys in every map must be sorted length-first by the byte representation of the string keys"

MarshalX commented 8 months ago

I'm pretty sure key ordering is still strict? https://ipld.io/specs/codecs/dag-cbor/spec/#strictness

"The keys in every map must be sorted length-first by the byte representation of the string keys"

ah. i was confused because of ~different stricness for encoding and decoding. for decoding it's fine to not keep order.~ thank you!

i was confused because of

Map key ordering: map entries may be accepted in any order

DavidBuchanan314 commented 8 months ago

Ah good point, seems like a "non-strict" decoder is allowed to allow them - I think we should be strict by default in the atproto-verse :)

MarshalX commented 8 months ago

mb i will add "strict" bool for decoding

DavidBuchanan314 commented 8 months ago

Hm it looks like libipld.encode_dag_cbor() is also non-strict with the order of map keys it emits

import libipld
import dag_cbor

encoded = libipld.encode_dag_cbor({'aaa': 1, 'x': 2}) # b'\xa2caaa\x01ax\x02'
dag_cbor.decode(encoded) # raises dag_cbor.decoding.err.DAGCBORDecodingError
MarshalX commented 8 months ago

you found all my TODOs xd

MarshalX commented 8 months ago

do you check "duplicate key" on encode? fixed in decode:

DavidBuchanan314 commented 8 months ago

I don't check on encode because (as far as I know) it's impossible for a python dict to have duplicate keys in the first place.

It's plausible that edge-cases exist where you have a str and a subclass of str that compares non-equal but has equivalent str result (I don't think I allow str subclasses tho)

DavidBuchanan314 commented 8 months ago
import json
import dag_cbor
import cbrrr
import libipld

class not_str(str):
    def __eq__(self, __value: object) -> bool:
        return False

    def __hash__(self) -> int:
        return super().__hash__()

hello = "hello"
not_hello = not_str("hello")

foo = {hello: "a", not_hello: "b"}

# !
print(foo) # {'hello': 'a', 'hello': 'b'}

# !!
print(json.dumps(foo)) # {"hello": "a", "hello": "b"}

# !!!
print(dag_cbor.encode(foo)) # b'\xa2ehelloaaehelloab'

# !!!
print(libipld.encode_dag_cbor(foo)) # b'\xa2ehelloaaehelloab'

# :sunglasses emoji:
try:
    print(cbrrr.encode_dag_cbor(foo))
    assert(not "reachable")
except TypeError:
    pass

To be honest this is such a contrived test case that it's probably not worth caring about, but for the record, it is possible to create a dict with pseudo-duplicate keys

MarshalX commented 8 months ago

fix for keys order:

MarshalX commented 8 months ago

fixed in v1.2.2. published on pypi already. thank you!! btw it worth to rerun benchmarks cuz perf was affected too