Open julliannailluj opened 1 month ago
UPDATE: The best solution I've found for now is to pipe the unaltered transcript into a python script in this way:
yt --transcript --lang 'fr' https://www.youtube.com/watch?v=HLD3BFdE0fU | python3 fix_french_typos.py
There must be a better way, for example specifying the correct code (UTF8 or something) somewhere, but I don't know how to do it and this works.
Here is the python script:
import re import sys
def fix_encoding_issues(text): replacements = { "é": "é", "è": "è", "ê": "ê", "à ": "à", "â": "â", "ç": "ç", "ë": "ë", "î": "î", "ô": "ô", "ù": "ù", "û": "û", "ü": "ü", "ÿ": "ÿ", "À": "À", "Â": "Â", "Ã": "Ã", "Ä": "Ä", "Ã…": "Å", "Æ": "Æ", "Ç": "Ç", "È": "È", "É": "É", "Ê": "Ê", "Ë": "Ë", "ÃŒ": "Ì", "Ã": "Í", "ÃŽ": "Î", "Ñ": "Ñ", "Ã’": "Ò", "Ó": "Ó", "Ô": "Ô", "Õ": "Õ", "Ö": "Ö", "Ø": "Ø", "Ù": "Ù", "Ú": "Ú", "Û": "Û", "Ãœ": "Ü", "Ã": "Ý", "Þ": "Þ", "ß": "ß", "á": "á", "â": "â", "ã": "ã", "ä": "ä", "Ã¥": "å", "æ": "æ", "ç": "ç", "è": "è", "é": "é", "ê": "ê", "ë": "ë", "ì": "ì", "î": "î", "ï": "ï", "ð": "ð", "ñ": "ñ", "ò": "ò", "ó": "ó", "ô": "ô", "õ": "õ", "ö": "ö", "÷": "÷", "ø": "ø", "ù": "ù", "ú": "ú", "û": "û", "ü": "ü", "ý": "ý", "þ": "þ", "ÿ": "ÿ", "Ý" : "à" }
for key, value in replacements.items(): text = text.replace(key, value) return text
def main():
Read text from standard input
input_text = sys.stdin.read() # Fix encoding issues corrected_text = fix_encoding_issues(input_text) # Print the cleaned text to standard output print(corrected_text)
if name == "main": main()`
Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me,
I also mentioned this bug before, but seems no fix for that yet
Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me, I also mentioned this bug before, but seems no fix for that yet
You're right! I tried exactly this before, but didn't think about rebuilding. I did 'pipx install . --force' and it was done! Thanks!
What do you need?
I'm trying to use the yt --transcript function in languages other than English (French). Transcription contains formatting problems:
I tried to fix it with 2 different approaches:
From my attempts, the best fix for now is using a pattern. I tried it in 2 ways, natural language, and asking it to mimick a given python function. The second solution worked best. But it's not perfect and works in a very random manner. It often fixes the formatting problem, but sometimes randomly changes a small amount of words (with other words). Also, the puntuation is usually simply lacking, or not as good as it is in English. Those problems happens even when Youtube has a correct subtitle file in French. It also sometimes doesn't work at all and gives me comments regarding the python function.
The results are encouraging, but very random. I'm willing to improve it but maybe it's not the right approach. Any suggestions are welcome.
Here is an example of a command I used:
yt --transcript --lang 'fr' https://www.youtube.com/watch?v=oiKj0Z_Xnjc | fabric --model llama3:latest -sp convert_fr
And the output:
Finally, here is the content of my custom pattern "convert_fr":
cat system.md