[Feature request]: Multilingual yt - formatting issues

julliannailluj commented 1 month ago

What do you need?

I'm trying to use the yt --transcript function in languages other than English (French). Transcription contains formatting problems:

writes "Ã©" instead of "é", "Ãª" instead of "ê", "Ã" instead of "à"...
no punctuation

I tried to fix it with 2 different approaches:

using a custom pattern to fix those typos.
editing the yt.py by including a function to replace the defective set of typos with the correct ones.

From my attempts, the best fix for now is using a pattern. I tried it in 2 ways, natural language, and asking it to mimick a given python function. The second solution worked best. But it's not perfect and works in a very random manner. It often fixes the formatting problem, but sometimes randomly changes a small amount of words (with other words). Also, the puntuation is usually simply lacking, or not as good as it is in English. Those problems happens even when Youtube has a correct subtitle file in French. It also sometimes doesn't work at all and gives me comments regarding the python function.

The results are encouraging, but very random. I'm willing to improve it but maybe it's not the right approach. Any suggestions are welcome.

Here is an example of a command I used:

yt --transcript --lang 'fr' https://www.youtube.com/watch?v=oiKj0Z_Xnjc | fabric --model llama3:latest -sp convert_fr

And the output:

`It seems you're asking me to run the fix_encoding_issues function on this text. I'll do that for you.

Please note that the original encoding of the text is not specified, but based on the presence of non-ASCII characters (e.g., Ã¢, Ã¨, Ãª, etc.), I assume it's encoded in UTF-8.

Here's the cleaned text:

Enfin je saurai où je vais Maman dit que lorsqu'on cherche bien On finit toujours par trouver Elle dit qu'il n'est jamais très loin Qu'il part très souvent travailler Maman dit "travailler c'est bien" Bien mieux qu'être mal accompagné Pas vrai ? Où est ton papa ? Dis-moi où est ton papa ? Sans même devoir lui parler Il sait ce qui ne va pas Ah sacré papa Dis-moi où es-tu caché ? ?a doit, faire au moins mille fois que j'ai Compté mes doigts Hey ! Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Quoi, qu'on y croit ou pas Y aura bien un jour où on n'y croira plus Un jour ou l'autre on sera tous papa Et d'un jour ? l'autre on aura disparu Serons-nous détestables ? Serons-nous admirables ? Des géniateurs ou des génies ? Dites-nous qui donne naissance aux irresponsables ? Ah dites-nous qui, tiens Tout le monde sait comment on fait des bébés Mais personne sait comment on fait des papas Monsieur Je-sais-tout en aurait hérité, c'est ça Faut l'sucer d'son pouce ou quoi ? Dites-nous où c'est caché, ça doit Faire au moins mille fois qu'on a Bouffé nos doigts Hey ! Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ?`

Finally, here is the content of my custom pattern "convert_fr":

cat system.md

You will only execute the following python functions in a given text. You do not delete or add words and lines at all. Keep all the original text, even if there are any grammatical mistakes.

import re

def fix_encoding_issues(text): replacements = { "Ã©": "é", "Ã¨": "è", "Ãª": "ê", "Ã ": "à", "Ã¢": "â", "Ã§": "ç", "Ã«": "ë", "Ã®": "î", "Ã´": "ô", "Ã¹": "ù", "Ã»": "û", "Ã¼": "ü", "Ã¿": "ÿ", "Ã€": "À", "Ã‚": "Â", "Ãƒ": "Ã", "Ã„": "Ä", "Ã…": "Å", "Ã†": "Æ", "Ã‡": "Ç", "Ãˆ": "È", "Ã‰": "É", "ÃŠ": "Ê", "Ã‹": "Ë", "ÃŒ": "Ì", "Ã": "Í", "ÃŽ": "Î", "Ã‘": "Ñ", "Ã’": "Ò", "Ã“": "Ó", "Ã”": "Ô", "Ã•": "Õ", "Ã–": "Ö", "Ã˜": "Ø", "Ã™": "Ù", "Ãš": "Ú", "Ã›": "Û", "Ãœ": "Ü", "Ã": "Ý", "Ãž": "Þ", "ÃŸ": "ß", "Ã¡": "á", "Ã¢": "â", "Ã£": "ã", "Ã¤": "ä", "Ã¥": "å", "Ã¦": "æ", "Ã§": "ç", "Ã¨": "è", "Ã©": "é", "Ãª": "ê", "Ã«": "ë", "Ã¬": "ì", "Ã®": "î", "Ã¯": "ï", "Ã°": "ð", "Ã±": "ñ", "Ã²": "ò", "Ã³": "ó", "Ã´": "ô", "Ãµ": "õ", "Ã¶": "ö", "Ã·": "÷", "Ã¸": "ø", "Ã¹": "ù", "Ãº": "ú", "Ã»": "û", "Ã¼": "ü", "Ã½": "ý", "Ã¾": "þ", "Ã¿": "ÿ", }
for key, value in replacements.items():
    text = text.replace(key, value)
return text
def main():

Load the text from a file
with open('input.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Fix encoding issues
text = fix_encoding_issues(text)

# Save the cleaned text to a new file
with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(text)
if name == "main": main()

julliannailluj commented 1 month ago

UPDATE: The best solution I've found for now is to pipe the unaltered transcript into a python script in this way:

yt --transcript --lang 'fr' https://www.youtube.com/watch?v=HLD3BFdE0fU | python3 fix_french_typos.py

There must be a better way, for example specifying the correct code (UTF8 or something) somewhere, but I don't know how to do it and this works.

Here is the python script:

import re import sys

def fix_encoding_issues(text): replacements = { "Ã©": "é", "Ã¨": "è", "Ãª": "ê", "Ã ": "à", "Ã¢": "â", "Ã§": "ç", "Ã«": "ë", "Ã®": "î", "Ã´": "ô", "Ã¹": "ù", "Ã»": "û", "Ã¼": "ü", "Ã¿": "ÿ", "Ã€": "À", "Ã‚": "Â", "Ãƒ": "Ã", "Ã„": "Ä", "Ã…": "Å", "Ã†": "Æ", "Ã‡": "Ç", "Ãˆ": "È", "Ã‰": "É", "ÃŠ": "Ê", "Ã‹": "Ë", "ÃŒ": "Ì", "Ã": "Í", "ÃŽ": "Î", "Ã‘": "Ñ", "Ã’": "Ò", "Ã“": "Ó", "Ã”": "Ô", "Ã•": "Õ", "Ã–": "Ö", "Ã˜": "Ø", "Ã™": "Ù", "Ãš": "Ú", "Ã›": "Û", "Ãœ": "Ü", "Ã": "Ý", "Ãž": "Þ", "ÃŸ": "ß", "Ã¡": "á", "Ã¢": "â", "Ã£": "ã", "Ã¤": "ä", "Ã¥": "å", "Ã¦": "æ", "Ã§": "ç", "Ã¨": "è", "Ã©": "é", "Ãª": "ê", "Ã«": "ë", "Ã¬": "ì", "Ã®": "î", "Ã¯": "ï", "Ã°": "ð", "Ã±": "ñ", "Ã²": "ò", "Ã³": "ó", "Ã´": "ô", "Ãµ": "õ", "Ã¶": "ö", "Ã·": "÷", "Ã¸": "ø", "Ã¹": "ù", "Ãº": "ú", "Ã»": "û", "Ã¼": "ü", "Ã½": "ý", "Ã¾": "þ", "Ã¿": "ÿ", "Ý" : "à" }
for key, value in replacements.items():
    text = text.replace(key, value)
return text
def main():

Read text from standard input
input_text = sys.stdin.read()

# Fix encoding issues
corrected_text = fix_encoding_issues(input_text)

# Print the cleaned text to standard output
print(corrected_text)
if name == "main": main()`

CaeChao commented 1 month ago

Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me, I also mentioned this bug before, but seems no fix for that yet

julliannailluj commented 1 month ago

Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me, I also mentioned this bug before, but seems no fix for that yet

You're right! I tried exactly this before, but didn't think about rebuilding. I did 'pipx install . --force' and it was done! Thanks!

danielmiessler / fabric

[Feature request]: Multilingual yt - formatting issues #468

What do you need?

Load the text from a file

Read text from standard input