argosopentech / translate-html

Translate HTML using Argos Translate
MIT License
49 stars 10 forks source link

doctype is broken after translating #12

Open dingedi opened 1 year ago

dingedi commented 1 year ago
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">

is broken in

html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"

image

it is an example for this doctype but in general as soon as there is a complex doctype it breaks everything

PJ-Finlay commented 1 year ago

Is this for passing the DOCTYPE tag through translate-html or is it going through the seq2seq model? I think the soup library should maintain the DOCTYPE but the seq2seq probably doesn't.

dingedi commented 1 year ago

I had tried via libretranslate, I just tried with translate-html and the problem is similar

image

dingedi commented 1 year ago

i think the problem come from itag_of_soup

def translate_html(underlying_translation, html):
    soup = BeautifulSoup(html, "html.parser")
    print('SOUP: ', soup)
    itag = itag_of_soup(soup)
    print('ITAG: ', itag)
    translated_tag = translate_tags(underlying_translation, itag)
    translated_soup = soup_of_itag(translated_tag)
    return translated_soup

result

SOUP:  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<p>hello</p>
ITAG:  <class 'argostranslate.tags.Tag'> "['html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"', <argostranslate.tags.Tag object at 0x7f4ebe57f750>]"
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"<p>Hola.</p>