Open tylike opened 1 month ago
~Apparently escape_double_quotes_and_backslashes
is not valid for mutibyte strings.~
~Maybe we should use replace or replace_all function for escaping.~
~Also there may be other problems too.~
EDIT: It seems to be a tokenizer problem. Two characters (瑞典) became four tokens:
"transcription": [
{
...
"text": "945年6月,瑞典著名犹太人建筑师马克思·甘佩尔接到了一份邀请函。",
"tokens": [
...
{
"text": "9",
...
},
{
"text": "45",
...
},
{
"text": "年",
...
},
{
"text": "6",
...
},
{
"text": "月",
...
},
{
"text": ",",
...
},
{
"text": "�",
...
},
{
"text": "�",
...
},
{
"text": "�",
...
},
{
"text": "�",
...
},
{
"text": "著",
Already reported here
Environment:
Command Used:
Issue:
.txt
and.srt
files are generated correctly..json
file contains garbled/incorrect characters.Additional Details:
.json
,.srt
, and.txt
files are all generated correctly.Steps to Reproduce:
.json
file.en_subtitle.json en_subtitle.srt.txt 80.zip