ali1234 / vhs-teletext

Software to recover teletext data from VHS recordings.
GNU General Public License v3.0
179 stars 21 forks source link

Improve support for international character sets #74

Open ali1234 opened 1 year ago

ali1234 commented 1 year ago

I have a plan to do this by using the codecs module to register all the different teletext charset options as codecs, so that packet bytes can be directly converted to unicode using eg bytes.decode('teletext-latin-1') to indicate the Latin G0/G1 set with national option 1.

This can also go the other way by saying str.encode('teletext-latin-1') and doing bytes -> str -> bytes this way should produce an identical string - with the caveat that any parity errors on the original bytes will be "fixed". It may be possible to control this behaviour, allowing the user to raise an exception on parity errors.

Note that, when decoding, Teletext spacing attributes 0x00 - 0x1f will be mapped to Unicode C0 0x00 - 0x1f. In other words they will be left untouched. When encoding a string that uses a mixture of G0 and G2 (mosaics), the appropriate control codes could be inserted in the string, but again this behaviour could be optional and it could instead raise an exception.

Once this is done, the Printer class and subclasses can be simplified to just use codecs, and then they will only have to worry about converting the C0 characters to ANSI, HTML, or simply removing them entirely (ie replace them with spaces or the current held mosaic.)

It turns out that there may need to be multiple different Unicode mappings, because some environments can only render Unicode characters from the basic multilingual plane, ie <= 0xffff. The mosaic characters in Unicode are outside this plane. It should be possible to map all alphanumerics though. Note that ZVBI uses a mapping that places Arabic alphanumerics in the private use area.

Steps: