deepgram / deepgram-python-sdk

Official Python SDK for Deepgram's automated speech recognition APIs.
https://developers.deepgram.com
MIT License
243 stars 63 forks source link

Very slow JSON serialization and deserialization and blocking event loop #489

Open Luksalos opened 4 days ago

Luksalos commented 4 days ago

What is the current behavior?

PrerecordedResponse.from_json(result) (link to code) is very slow, especially for larger inputs. This is due to the Dataclasses JSON library, where they are already aware of that performance issue but haven’t addressed it since 2020. In addition to .from_json(), the .to_dict() operation is also very slow, which one would use if they want to parse the output from the Deepgram SDK into their own Pydantic model.

In our case, for recordings lasting around 1 hour:

source = {"url": signed_url}
options = rerecordedOptions(
        model="nova-2-general",
        diarize=True,
        utterances=True,
        paragraphs=True)
deepgram.listen.rest.v("1").transcribe_url(source, options=options)

The .from_json() takes over 10 seconds. Pydantic parsing takes ~30ms. For a 7-minute recording, the .from_json() operation took ~1.7 seconds, while Pydantic parsing took ~5ms.

This issue also affects the asynchronous version, where the problem is even more significant as it blocks the event loop for a long time.

Expected behavior

JSON serialization and deserialization shouldn't take that long, and CPU-heavy operations should definitely not block the event loop. Please consider using Pydantic or raw dataclasses.

jjmaldonis commented 4 days ago

Adding __slots__ to the dataclasses may help -- this is worth a quick try. I have not tested, and I don't know if dataclasses actually support __slots__, but adding the class variable can result in dramatic speed improvements.

Overall, my opinion is that dataclasses begin to break down once the scope of their usage extends past the immediate value proposition of dataclasses, and a different implementation tends to work better. Pydantic tends to be used for input validation, which isn't a critically important feature within this SDK because responses do not need to be validated. That said, I'm a big fan of pydantic in general. But choosing a different class implementation may give us the speed and flexibility wins we're looking for. That said, moving away from dataclasses will be a major breaking change.