Fastavro fails to parse schema that is a top-level named schema in a dict

hhoeflin commented 6 days ago

I discovered that fastavro fails to parse schemas of the form

{'type': 'test.point'}

or

{'type': 'test.point', 'logicalType': 'LogicalPoint'}

where an error is thrown that the schema can't be loaded. Top-level logical types with primitive types however work. The error seems to be in _read_schema, where in the dict branch for a schema, a named schema is not checked for the type. I thought according to the spec however, the above should be allowed?

Below code where several other variants are tried that work:

import fastavro

point = dict(
    type="record",
    name="point",
    namespace="test",
    doc="A point",
    fields=[dict(name="x", type="float"), dict(name="y", type="float")],
)

logical_with_primitive = dict(type="float", logicalType="LogicalFloat")
logical_with_point = dict(type="test.point", logicalType="LogicalPoint")
nested_point = dict(
    type="record",
    name="nested_point",
    fields=[dict(name="nested", type="test.point")],
)
point_as_type = dict(type="test.point")
just_point = "test.point"

if __name__ == "__main__":
    named_schemas = {}
    for schema in [
        point,
        logical_with_primitive,
        logical_with_point,
        nested_point,
        point_as_type,
        just_point,
    ]:
        print(f"Schema: {schema}")
        try:
            fastavro.parse_schema(schema, named_schemas)
        except Exception:
            print("Failed to load schema")
        else:
            print("Schema loaded successfully")

with results:

Schema: {'type': 'record', 'name': 'point', 'namespace': 'test', 'doc': 'A point', 'fields': [{'name': 'x', 'type': 'float'}, {'name': 'y', 'type': 'float'}]}
Schema loaded successfully
Schema: {'type': 'float', 'logicalType': 'LogicalFloat'}
Schema loaded successfully
Schema: {'type': 'test.point', 'logicalType': 'LogicalPoint'}
Failed to load schema
Schema: {'type': 'record', 'name': 'nested_point', 'fields': [{'name': 'nested', 'type': 'test.point'}]}
Schema loaded successfully
Schema: {'type': {'type': 'test.point', 'logicalType': 'LogicalPoint'}}
Failed to load schema
Schema: {'type': 'test.point'}
Failed to load schema
Schema: test.point
Schema loaded successfully

hhoeflin commented 6 days ago

I think this is also related to issue #774, as the array-type parsed there also presents as a 'dict' with a named-type. The '_parse_schema' would be called again and fail to recognize it.

hhoeflin commented 5 days ago

Looking into the code, I have the feeling solving this and other tickets would require a much deeper refactoring of the schema parser. Is that something that is on your roadmap? Or are you open for a PR if that PR changes quite a lot of things? Also the schema parsing is currently in cython a lot - is that performance critical in your experience?

Design goals for a re-design for me would be:

Unless syntactically wrong, parsing should always succeed
It should return a list of unknown named entities and a list of LogicalTypes it doesn't know how to handle
it should have the option to completely decompose the schema and replace any named entities with references to them
Response should be the parsed schema and a dict of all contained resolved named entities (explicitly not implicitly through passed in parameter)

Such changes should in the end allow for example for loading of supporting schemas without specific order.

hhoeflin commented 2 days ago

Hi, I wrote some improvements for the parser:

https://github.com/hhoeflin/fastavro/blob/feature/parser/fastavro/parse_new.py

It allows for parsing, cleaning, decomposing and reassembling a schema. As it does so without throwing errors on missing schemas, is also much easier to include Repository logic here.

If you have time to take a look, any feedback if something like this is interesting for the fastavro project would be great.

Thanks

hhoeflin commented 2 days ago

Also a speed comparison of reading from a file-pointer (BytesIO) versus directly from byte-array with keeping read location as an integer. The version reading from bytes-array is 50 times faster.

https://github.com/hhoeflin/fastavro/blob/feature/speed_compare/fastavro/bytes_speed_compare.pyx

python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_cdef(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader_single(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytesio(bsc.lorem_ipsum)"

On my machine I get

282 nsec
278 nsec
350 nsec
14.8 usec

So reading from a bytes array directly is much faster. In your code, you are using cpdef and reading from fo everywhere. Was that mostly for convenience? When reading a whole file block, the data would be available as a bytes object already.

scottbelden commented 16 hours ago

Thanks for all the detail here. I haven't had time to look through everything here, but in general:

No opposition to large changes
More correctness is always better as long as it doesn't break existing functionality (unless some existing functionality is obviously wrong)

If you already have some changes, feel free to make the PR.

hhoeflin commented 16 hours ago

Thanks for your reply. One thing after going back over the specs and looking at the avro python package I am really not sure on is the following:

Is

{ "type": { "type": "array", "items": "string"}, "logicalType": "mytype"}

a valid schema? or is it invalid and has to be

 { "type": "array", "items": "string", "logicalType": "mytype"}

If the first is incorrect, this would argue that the replacement rule at the top of the issue

{'type': 'test.point'}

or

{'type': 'test.point', 'logicalType': 'LogicalPoint'}

are also syntactically incorrect? The specs say that {'type':'int'} is same as 'int', but only explicitly mentioned for primitives.

Anyway, I guess before I understand what is actually correct, not sure it makes sense to move forward with any implementation that may in fact parse incorrect specs.

Thanks for having a look though.

fastavro / fastavro

Fastavro fails to parse schema that is a top-level named schema in a dict #804