Supports of Web Page Encoding

StephDC commented 8 years ago

This program need to get web page information and extract translations from them.

Given the context that the web page has to be decoded correctly before send to Beautiful Soup to parse, to detect or know which encoding was used by the website is important.

Generically there are a few locations for a HTML page to declare the char encoding: HTTP header

Content-Type: text/html charset=<charset>

And HTML head

<html>
    <head>
        <meta charset=<charset> />
    </head>
</html>

Also if the website did not declare the encoding in anywhere, we might still be able to guess the encoding based on the following information.

Another context of the encoding problem is that this is currently designed to help translate Chinese (zh_CN + zh_TW) <=> English (en_US + en_UK), therefore the encoding should only be limited to these languages. There is only a limited amount of encoding which we can use trial and error to figure out which one is in use.

Now this program is using Requests and look for HTTP header encoding information only. It then use Requests.Response.apparent_encoding to try to guess the encoding (using chardet, according to http://requests.readthedocs.org/en/master/api/#requests.Response.apparent_encoding).

If we are not using Requests / chardet or if chardet did not work (It seemed not work with SHIFT-JIS from http://hanfucw.com:9676/cgi-bin/sjis.tail and maybe more encodings), do we need to guess the encodings? If we do need to guess, how about make a list of encodings to guess with? e.g. Based on Chinese <=> English context

Unicode: UTF-8, UTF-16BE, UTF-16LE
zh_CN: GB2312, GB18030, GBK
zh_TW: Big5
zh_HK: HKSCS
ja_JP: SHIFT-JIS, EUC-JIS (Also include as it can be used to encode a few chinese characters)

RihanWu commented 8 years ago

Since I plan to make this program extendable to help learn vocabulary in other languages, simple trying encoding that websites involving Chinese use may not be a permanent solution. How about using the existing mechanism to decode data that the program can figure out, and manually add decoding information in the config file for those that the program cannot guess. Since the sources are also actually added manually, this should work.

StephDC commented 8 years ago

That is a good idea. We can leave a blank in our source description file, so if it is necessary, we can manually specify the encoding of the website.

Some modifications on the new urllibRequest need to be done in order to allow such manual specification.

RihanWu commented 8 years ago

How about pass a optional encoding argument to get, and then let get pass that to the result? Since I actually removed the use of .text, we do not need the class Response now.

RihanWu commented 8 years ago

What do you think about the changes I made to the urllib_requests module on the add_unittest branch? @StephDC

RihanWu / vocabtool

Supports of Web Page Encoding #6