SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.59k stars 1.05k forks source link

replace `NamedTuple` with `dataclass` #1105

Closed MahmoudAshraf97 closed 2 weeks ago

MahmoudAshraf97 commented 3 weeks ago

this is to allow conversion of Segment and Words to dict or json without recursively going through them Similar to #667 and #1104

Unlike NamedTuples, dataclasses are not iterable, so they need to be converted to a dict first

MahmoudAshraf97 commented 3 weeks ago

@extrange

all data classes should be serializable recursively using this:

from dataclasses import asdict
import json
json.dumps(asdict(your_dataclass))

let me know if this satisfies your use case and whether your PR has any other use cases that are not mentioned here

extrange commented 3 weeks ago

Thanks for looking at this!

I'm not too familiar with dataclasses, but it looks like there's an issue with parsing nested dataclasses, for example:

import json
from dataclasses import dataclass, asdict

@dataclass
class Bar:
    y: str

@dataclass
class Foo:
    x: str
    bar: Bar

baz = Foo(x="x", bar=Bar(y="y"))

out = json.dumps(asdict(baz))
# '{"x": "x", "bar": {"y": "y"}}'

The parameter bar is not parsed, nor type-checked:

Foo(**json.loads(out))
# Foo(x='x', bar={'y': 'y'})

We can do it manually, but this will break if new keys are added:

obj = json.loads(out)
Foo(x=obj["x"], bar=Bar(**obj["bar"]))
# Foo(x='x', bar=Bar(y='y'))
MahmoudAshraf97 commented 3 weeks ago

I didn't consider the parsing case tbh, is there a use case in your mind that needs converting the json back to a dataclass in faster whisper? as far as I'm concerned, these classes are pretty stable and no keys are being added or removed since a long time ago so the manual solution might not be so bad after all

This PR solves the serialization problem with minimal code changes and no additional dependencies, Pydantic might be the best solution, but I think it's overkill in this repo and requires a lot of changes that I might not be comfortable with

extrange commented 3 weeks ago

The main use case for us is when working with AWS SageMaker endpoints, for which we have to serialize parameters and deserialize the output of the transcription.

For now we're using a helper library which we've pinned to specific commits of faster-whisper.

MahmoudAshraf97 commented 3 weeks ago

You can still use Pydantic validation, at least for type checking

import pydantic
from faster_whisper.transcribe import Segment

@pydantic.dataclasses.dataclass
class Segment(Segment):
    ...

Segment(
    **{
        "id": "invalid_id",
        "seek": 0,
        "start": 0.0,
        "end": 1.0,
        "text": "transcription",
        "tokens": [1, 2, 3],
        "avg_logprob": 0.1,
        "compression_ratio": 0.1,
        "no_speech_prob": 0.1,
        "words": None,
        "temperature": 1.0,
    }
)
ValidationError: 1 validation error for Segment
id
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='invalid_id', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/int_parsing

When this PR is merged, you can use the wrapper to add pydantic type validation to the existing dataclass instances, these instances can be passed to any function that expects an instance of the original classes

since the parameters don't contain nested dataclass instances, deserialization should work perfectly, and you can deserialize the transcription results without going through them recursively

extrange commented 3 weeks ago

That's a nice solution. Pydantic supports nested stdlib dataclasses so I think this will also work for us (since maybe the deserialization use case isn't sufficient to have pydantic as an additional dependency).

We'll probably still use Pydantic models for parameter passing into .transcribe(), since we need to serialize/deserialize the arguments to/from JSON.