Open dingedi opened 1 year ago
Is this for passing the DOCTYPE tag through translate-html or is it going through the seq2seq model? I think the soup library should maintain the DOCTYPE but the seq2seq probably doesn't.
I had tried via libretranslate, I just tried with translate-html and the problem is similar
i think the problem come from itag_of_soup
def translate_html(underlying_translation, html):
soup = BeautifulSoup(html, "html.parser")
print('SOUP: ', soup)
itag = itag_of_soup(soup)
print('ITAG: ', itag)
translated_tag = translate_tags(underlying_translation, itag)
translated_soup = soup_of_itag(translated_tag)
return translated_soup
result
SOUP: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<p>hello</p>
ITAG: <class 'argostranslate.tags.Tag'> "['html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"', <argostranslate.tags.Tag object at 0x7f4ebe57f750>]"
html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"<p>Hola.</p>
is broken in
it is an example for this doctype but in general as soon as there is a complex doctype it breaks everything