Add import strategy to handle encoding 'ISO-8859-1'

McArcady commented 4 years ago

On Linux / Python 3.7, encoding iso-8859-1 is required to import file 'gamelog.txt'. This patch adds an import strategy to explicitly set this encoding.

Seems to work as well with Linux / Python 2.7. NOT tested under Windows or MacOS.

cryzed commented 4 years ago

Are you sure that's the encoding Dwarf Fortress actually uses to write the game log? Do you have a source for that? Does it actually change based on the configured system locale? From what I understood the game log is encoded using CP850.

McArcady commented 4 years ago

Well, i'm not so sure anymore! DF creates the gamelog.txt seemingly with charset 'us-ascii' :

$ file -i df_47_04_linux/gamelog.txt
df_47_04_linux/gamelog.txt: text/x-diff; charset=us-ascii

My old gamelogs have been imported over and over, and for some reason they all seem to have a different charset now:

$ find /opt -name gamelog.txt |xargs file  -i
/opt/lnp-0.13/df_44_04_linux/gamelog.txt:                  application/octet-stream; charset=binary
/opt/LinuxDwarfPack-0.47.03-r1/df_47_03_linux/gamelog.txt: text/x-diff; charset=utf-8
/opt/LinuxDwarfPack-0.44.12-4/df_44_12_linux/gamelog.txt:  application/octet-stream; charset=binary
/opt/LinuxDwarfPack-0.47.04-r1/df_47_04_linux/gamelog.txt: text/x-diff; charset=iso-8859-1
/opt/LinuxDwarfPack-0.47.04-r2/df_47_04_linux/gamelog.txt: text/x-diff; charset=utf-8
/opt/LinuxDwarfPack-0.47.04-r3/df_47_04_linux/gamelog.txt: text/x-diff; charset=iso-8859-1

Result: most of them can not be read anymore with the python3 default encoding 'utf8', and that's why the import failed.

Maybe I could adapt the patch this way:
in strat_text_prepend:

try reading the file with encoding 'utf8' by default
if read() fails, try opening the file again with encoding 'ISO-8859-1'

Then we would not need this clumsy strategy 'text_prepend_iso_8859_1'.

Pidgeot commented 4 years ago

Unless it changed recently, gamelog.txt ought to be encoded as code page 437 (i.e. US DOS code page), since that's what the default graphics set is based on and it's just a direct dump of the strings in memory.

However... we're not going to actually be manipulating the text here, so really, it doesn't matter which encoding we pick, as long as it has defined characters for the full byte range and the encoding is the same for both input files and output files.

In theory we might have problems if a file changes code page at some point, but then, we can't handle that without knowing the correct encodings anyway, so that's a bridge we'd have to cross when we get to it.

For now, I'm just going to change this function to always use latin1, since that's good enough for what we need.

Pidgeot / python-lnp

Add import strategy to handle encoding 'ISO-8859-1' #171