clamsproject / app-whisper-wrapper

Apache License 2.0
0 stars 0 forks source link

whitespace in token `word' prop #20

Closed keighrim closed 4 months ago

keighrim commented 5 months ago

Bug Description

In some output MMIF, I see each Token annotation has word property value that starts with a whitespace. We need to first investigate whether this is a whisper (upstream) bug, and make adjustments accordingly.

Reproduction steps

for example, in aapb-evaluations/asr_eval/preds@whisper-wrapper-tiny@aapb-collaboration-21/cpb-aacip-507-zw18k75z4h.whisper-tiny.mmif

       {
          "@type": "http://vocab.lappsgrid.org/Token",
          "properties": {
            "word": " Funding",   # 
            "start": 0,
            "end": 8,
            "document": "v_0:td_1",
            "id": "to_1"
          }
        },
        {
          "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v1",
          "properties": {
            "frameType": "speech",
            "start": 39.58,
            "end": 40.26,
            "id": "tf_1"
          }
        },
        {
          "@type": "http://mmif.clams.ai/vocabulary/Alignment/v1",
          "properties": {
            "source": "tf_1",
            "target": "to_1",
            "id": "al_2"
          }
        },
        {
          "@type": "http://vocab.lappsgrid.org/Token",
          "properties": {
            "word": " for",  # 
            "start": 9,
            "end": 13,
            "document": "v_0:td_1",
            "id": "to_2"
          }
        },
      ...

Expected behavior

No response

Log output

No response

Screenshots

No response

Additional context

No response

selenasong commented 4 months ago

This problem is already solved in v6.

keighrim commented 4 months ago

fixed in 80d808d991255d20f7c1c2b9aab6f3a506c869e0 (v4) , closing the issue.