Closed MahmoudAshraf97 closed 2 weeks ago
@extrange
all data classes should be serializable recursively using this:
from dataclasses import asdict
import json
json.dumps(asdict(your_dataclass))
let me know if this satisfies your use case and whether your PR has any other use cases that are not mentioned here
Thanks for looking at this!
I'm not too familiar with dataclasses, but it looks like there's an issue with parsing nested dataclasses, for example:
import json
from dataclasses import dataclass, asdict
@dataclass
class Bar:
y: str
@dataclass
class Foo:
x: str
bar: Bar
baz = Foo(x="x", bar=Bar(y="y"))
out = json.dumps(asdict(baz))
# '{"x": "x", "bar": {"y": "y"}}'
The parameter bar
is not parsed, nor type-checked:
Foo(**json.loads(out))
# Foo(x='x', bar={'y': 'y'})
We can do it manually, but this will break if new keys are added:
obj = json.loads(out)
Foo(x=obj["x"], bar=Bar(**obj["bar"]))
# Foo(x='x', bar=Bar(y='y'))
I didn't consider the parsing case tbh, is there a use case in your mind that needs converting the json back to a dataclass in faster whisper? as far as I'm concerned, these classes are pretty stable and no keys are being added or removed since a long time ago so the manual solution might not be so bad after all
This PR solves the serialization problem with minimal code changes and no additional dependencies, Pydantic might be the best solution, but I think it's overkill in this repo and requires a lot of changes that I might not be comfortable with
The main use case for us is when working with AWS SageMaker endpoints, for which we have to serialize parameters and deserialize the output of the transcription.
For now we're using a helper library which we've pinned to specific commits of faster-whisper.
You can still use Pydantic
validation, at least for type checking
import pydantic
from faster_whisper.transcribe import Segment
@pydantic.dataclasses.dataclass
class Segment(Segment):
...
Segment(
**{
"id": "invalid_id",
"seek": 0,
"start": 0.0,
"end": 1.0,
"text": "transcription",
"tokens": [1, 2, 3],
"avg_logprob": 0.1,
"compression_ratio": 0.1,
"no_speech_prob": 0.1,
"words": None,
"temperature": 1.0,
}
)
ValidationError: 1 validation error for Segment
id
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='invalid_id', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/int_parsing
When this PR is merged, you can use the wrapper to add pydantic type validation to the existing dataclass instances, these instances can be passed to any function that expects an instance of the original classes
since the parameters don't contain nested dataclass instances, deserialization should work perfectly, and you can deserialize the transcription results without going through them recursively
That's a nice solution. Pydantic supports nested stdlib dataclasses so I think this will also work for us (since maybe the deserialization use case isn't sufficient to have pydantic as an additional dependency).
We'll probably still use Pydantic models for parameter passing into .transcribe()
, since we need to serialize/deserialize the arguments to/from JSON.
this is to allow conversion of
Segment
andWords
todict
orjson
without recursively going through them Similar to #667 and #1104Unlike NamedTuples, dataclasses are not iterable, so they need to be converted to a dict first