Open hhoeflin opened 6 days ago
I think this is also related to issue #774, as the array-type parsed there also presents as a 'dict' with a named-type. The '_parse_schema' would be called again and fail to recognize it.
Looking into the code, I have the feeling solving this and other tickets would require a much deeper refactoring of the schema parser. Is that something that is on your roadmap? Or are you open for a PR if that PR changes quite a lot of things? Also the schema parsing is currently in cython a lot - is that performance critical in your experience?
Design goals for a re-design for me would be:
Such changes should in the end allow for example for loading of supporting schemas without specific order.
Hi, I wrote some improvements for the parser:
https://github.com/hhoeflin/fastavro/blob/feature/parser/fastavro/parse_new.py
It allows for parsing, cleaning, decomposing and reassembling a schema. As it does so without throwing errors on missing schemas, is also much easier to include Repository logic here.
If you have time to take a look, any feedback if something like this is interesting for the fastavro project would be great.
Thanks
Also a speed comparison of reading from a file-pointer (BytesIO) versus directly from byte-array with keeping read location as an integer. The version reading from bytes-array is 50 times faster.
https://github.com/hhoeflin/fastavro/blob/feature/speed_compare/fastavro/bytes_speed_compare.pyx
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_cdef(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytereader_single(bsc.lorem_ipsum)"
python -m timeit -s "import fastavro.bytes_speed_compare as bsc" "bsc.py_count_ls_bytesio(bsc.lorem_ipsum)"
On my machine I get
So reading from a bytes array directly is much faster. In your code, you are using cpdef
and reading from fo
everywhere. Was that mostly for convenience? When reading a whole file block, the data would be available as a bytes object already.
Thanks for all the detail here. I haven't had time to look through everything here, but in general:
If you already have some changes, feel free to make the PR.
Thanks for your reply. One thing after going back over the specs and looking at the avro python package I am really not sure on is the following:
Is
{ "type": { "type": "array", "items": "string"}, "logicalType": "mytype"}
a valid schema? or is it invalid and has to be
{ "type": "array", "items": "string", "logicalType": "mytype"}
If the first is incorrect, this would argue that the replacement rule at the top of the issue
{'type': 'test.point'}
or
{'type': 'test.point', 'logicalType': 'LogicalPoint'}
are also syntactically incorrect? The specs say that {'type':'int'}
is same as 'int'
, but only explicitly mentioned for primitives.
Anyway, I guess before I understand what is actually correct, not sure it makes sense to move forward with any implementation that may in fact parse incorrect specs.
Thanks for having a look though.
I discovered that fastavro fails to parse schemas of the form
or
where an error is thrown that the schema can't be loaded. Top-level logical types with primitive types however work. The error seems to be in
_read_schema
, where in thedict
branch for a schema, a named schema is not checked for the type. I thought according to the spec however, the above should be allowed?Below code where several other variants are tried that work:
with results: