fastavro / fastavro

Fast Avro for Python
MIT License
638 stars 170 forks source link

Fastavro fails to parse schema that is a top-level named schema in a dict #804

Open hhoeflin opened 6 days ago

hhoeflin commented 6 days ago

I discovered that fastavro fails to parse schemas of the form

{'type': 'test.point'}

or

{'type': 'test.point', 'logicalType': 'LogicalPoint'}

where an error is thrown that the schema can't be loaded. Top-level logical types with primitive types however work. The error seems to be in _read_schema, where in the dict branch for a schema, a named schema is not checked for the type. I thought according to the spec however, the above should be allowed?

Below code where several other variants are tried that work:

import fastavro

point = dict(
    type="record",
    name="point",
    namespace="test",
    doc="A point",
    fields=[dict(name="x", type="float"), dict(name="y", type="float")],
)

logical_with_primitive = dict(type="float", logicalType="LogicalFloat")
logical_with_point = dict(type="test.point", logicalType="LogicalPoint")
nested_point = dict(
    type="record",
    name="nested_point",
    fields=[dict(name="nested", type="test.point")],
)
point_as_type = dict(type="test.point")
just_point = "test.point"

if __name__ == "__main__":
    named_schemas = {}
    for schema in [
        point,
        logical_with_primitive,
        logical_with_point,
        nested_point,
        point_as_type,
        just_point,
    ]:
        print(f"Schema: {schema}")
        try:
            fastavro.parse_schema(schema, named_schemas)
        except Exception:
            print("Failed to load schema")
        else:
            print("Schema loaded successfully")

with results:

Schema: {'type': 'record', 'name': 'point', 'namespace': 'test', 'doc': 'A point', 'fields': [{'name': 'x', 'type': 'float'}, {'name': 'y', 'type': 'float'}]}
Schema loaded successfully
Schema: {'type': 'float', 'logicalType': 'LogicalFloat'}
Schema loaded successfully
Schema: {'type': 'test.point', 'logicalType': 'LogicalPoint'}
Failed to load schema
Schema: {'type': 'record', 'name': 'nested_point', 'fields': [{'name': 'nested', 'type': 'test.point'}]}
Schema loaded successfully
Schema: {'type': {'type': 'test.point', 'logicalType': 'LogicalPoint'}}
Failed to load schema
Schema: {'type': 'test.point'}
Failed to load schema
Schema: test.point
Schema loaded successfully
hhoeflin commented 6 days ago

I think this is also related to issue #774, as the array-type parsed there also presents as a 'dict' with a named-type. The '_parse_schema' would be called again and fail to recognize it.

hhoeflin commented 5 days ago

Looking into the code, I have the feeling solving this and other tickets would require a much deeper refactoring of the schema parser. Is that something that is on your roadmap? Or are you open for a PR if that PR changes quite a lot of things? Also the schema parsing is currently in cython a lot - is that performance critical in your experience?

Design goals for a re-design for me would be:

Such changes should in the end allow for example for loading of supporting schemas without specific order.

hhoeflin commented 2 days ago

Hi, I wrote some improvements for the parser:

https://github.com/hhoeflin/fastavro/blob/feature/parser/fastavro/parse_new.py

It allows for parsing, cleaning, decomposing and reassembling a schema. As it does so without throwing errors on missing schemas, is also much easier to include Repository logic here.

If you have time to take a look, any feedback if something like this is interesting for the fastavro project would be great.

Thanks

hhoeflin commented 2 days ago

Also a speed comparison of reading from a file-pointer (BytesIO) versus directly from byte-array with keeping read location as an integer. The version reading from bytes-array is 50 times faster.

https://github.com/hhoeflin/fastavro/blob/feature/speed_compare/fastavro/bytes_speed_compare.pyx

python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_cdef(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader_single(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytesio(bsc.lorem_ipsum)"

On my machine I get

So reading from a bytes array directly is much faster. In your code, you are using cpdef and reading from fo everywhere. Was that mostly for convenience? When reading a whole file block, the data would be available as a bytes object already.

scottbelden commented 16 hours ago

Thanks for all the detail here. I haven't had time to look through everything here, but in general:

If you already have some changes, feel free to make the PR.

hhoeflin commented 16 hours ago

Thanks for your reply. One thing after going back over the specs and looking at the avro python package I am really not sure on is the following:

Is

{ "type": { "type": "array", "items": "string"}, "logicalType": "mytype"}

a valid schema? or is it invalid and has to be

 { "type": "array", "items": "string", "logicalType": "mytype"}

If the first is incorrect, this would argue that the replacement rule at the top of the issue

{'type': 'test.point'}

or

{'type': 'test.point', 'logicalType': 'LogicalPoint'}

are also syntactically incorrect? The specs say that {'type':'int'} is same as 'int', but only explicitly mentioned for primitives.

Anyway, I guess before I understand what is actually correct, not sure it makes sense to move forward with any implementation that may in fact parse incorrect specs.

Thanks for having a look though.