Closed lexasoft123 closed 4 years ago
Can you verify that you are passing in a parsed schema into parse_msg
? You should be able to verify this because the parsed schema will have a key called __fastavro_parsed
.
The parse_schema
function checks for this key to know if it should return right away or continue parsing:
@scottbelden If I print schema after fastavro.parse_schema call, there is no __fastavro_parsed
key there.
Why might it happen?
I'm not sure. I would do a check throughout the code and see where it stops holding true. For example:
def get_schema():
with open('schema.json', 'r') as file:
schema = file.read()
parsed_schema = fastavro.parse_schema(json.loads(schema))
assert "__fastavro_parsed" in parsed_schema
return parsed_schema
def parse_msg(msg, schema):
bytes_io = BytesIO(msg)
assert "__fastavro_parsed" in schema
msg = fastavro.schemaless_reader(bytes_io, schema, reader_schema=schema)
return msg
...
schema = get_schema()
assert "__fastavro_parsed" in schema
while True:
msg = c.poll(timeout=1.0)
assert "__fastavro_parsed" in schema
parse_msg(msg.value(), schema)
@scottbelden it fails immediately after returning from parse_schema parsed_schema = fastavro.parse_schema(json.loads(schema))
Interesting. Is the schema being loaded a dictionary or a list? Are you able to provide the schema (or a simple version of the schema that shows the problem?
Schema contains a list, it is available here: https://github.com/Nasdaq/CloudDataService/blob/master/ncds-sdk/src/main/resources/schemas/TOTALVIEW.avsc
Okay. That explains it. We currently only do the fast route if it is a dictionary and has the key; so a list will always be parsed. I think it should be fairly easy to fix this.
Hello. Seems that there is a mistake in schemaless_reader, due to it parse_schema is called on every function call even if I use already parsed schema as an argument.
Here is an example of code I use to parse small Avro messages consumed from Kafka:
I noticed that parsing consumes most of CPU time. I ran a profiler and noticed that top time spent in parse_schema.
The following patch made a 1000x times performance benefit for me:
Please explain, is it a bug or there is a problem in how I use the library.