fastavro / fastavro

Fast Avro for Python
MIT License
645 stars 172 forks source link

Huge performance drop in schemaless_reader due to schema parsing on every call #444

Closed lexasoft123 closed 4 years ago

lexasoft123 commented 4 years ago

Hello. Seems that there is a mistake in schemaless_reader, due to it parse_schema is called on every function call even if I use already parsed schema as an argument.

Here is an example of code I use to parse small Avro messages consumed from Kafka:

def get_schema():
    with open('schema.json', 'r') as file:
        schema = file.read()
    return fastavro.parse_schema(json.loads(schema))

def parse_msg(msg, schema):
    bytes_io = BytesIO(msg)
    msg = fastavro.schemaless_reader(bytes_io, schema, reader_schema=schema)
    return msg

...

    schema = get_schema()
        while True:
            msg = c.poll(timeout=1.0)
            parse_msg(msg.value(), schema)

I noticed that parsing consumes most of CPU time. I ran a profiler and noticed that top time spent in parse_schema.

The following patch made a 1000x times performance benefit for me:

diff --git a/fastavro/_read.pyx b/fastavro/_read.pyx
index 3226b46..d460b42 100644
--- a/fastavro/_read.pyx
+++ b/fastavro/_read.pyx
@@ -838,7 +838,7 @@ cpdef schemaless_reader(fo, writer_schema, reader_schema=None,
         # No need for the reader schema if they are the same
         reader_schema = None

-    writer_schema = parse_schema(writer_schema)
+    #writer_schema = parse_schema(writer_schema)

     if reader_schema:
         reader_schema = parse_schema(reader_schema)
diff --git a/fastavro/_read_py.py b/fastavro/_read_py.py
index 17a4fff..1d4647f 100644
--- a/fastavro/_read_py.py
+++ b/fastavro/_read_py.py
@@ -892,10 +892,10 @@ def schemaless_reader(fo, writer_schema, reader_schema=None,
         # No need for the reader schema if they are the same
         reader_schema = None

-    writer_schema = parse_schema(writer_schema)
+    #writer_schema = parse_schema(writer_schema)

-    if reader_schema:
-        reader_schema = parse_schema(reader_schema)
+    #if reader_schema:
+    #    reader_schema = parse_schema(reader_schema)

     decoder = BinaryDecoder(fo)

Please explain, is it a bug or there is a problem in how I use the library.

scottbelden commented 4 years ago

Can you verify that you are passing in a parsed schema into parse_msg? You should be able to verify this because the parsed schema will have a key called __fastavro_parsed.

The parse_schema function checks for this key to know if it should return right away or continue parsing:

https://github.com/fastavro/fastavro/blob/6cbd4f70a505959918b4bd85b7aa863dbdfccdfb/fastavro/_schema_py.py#L191-L194

lexasoft123 commented 4 years ago

@scottbelden If I print schema after fastavro.parse_schema call, there is no __fastavro_parsed key there. Why might it happen?

scottbelden commented 4 years ago

I'm not sure. I would do a check throughout the code and see where it stops holding true. For example:

def get_schema():
    with open('schema.json', 'r') as file:
        schema = file.read()
    parsed_schema = fastavro.parse_schema(json.loads(schema))
    assert "__fastavro_parsed" in parsed_schema
    return parsed_schema

def parse_msg(msg, schema):
    bytes_io = BytesIO(msg)
    assert "__fastavro_parsed" in schema
    msg = fastavro.schemaless_reader(bytes_io, schema, reader_schema=schema)
    return msg

...

    schema = get_schema()
    assert "__fastavro_parsed" in schema
        while True:
            msg = c.poll(timeout=1.0)
            assert "__fastavro_parsed" in schema
            parse_msg(msg.value(), schema)
lexasoft123 commented 4 years ago

@scottbelden it fails immediately after returning from parse_schema parsed_schema = fastavro.parse_schema(json.loads(schema))

scottbelden commented 4 years ago

Interesting. Is the schema being loaded a dictionary or a list? Are you able to provide the schema (or a simple version of the schema that shows the problem?

lexasoft123 commented 4 years ago

Schema contains a list, it is available here: https://github.com/Nasdaq/CloudDataService/blob/master/ncds-sdk/src/main/resources/schemas/TOTALVIEW.avsc

scottbelden commented 4 years ago

Okay. That explains it. We currently only do the fast route if it is a dictionary and has the key; so a list will always be parsed. I think it should be fairly easy to fix this.