Boruch-Baum / emacs-crossword

Play/Download crossword puzzles in Emacs
GNU General Public License v3.0
67 stars 4 forks source link

Should open .puz with iso-8859-1 encoding interpretation #12

Open piyo opened 3 years ago

piyo commented 3 years ago

Thanks for this enjoyable Emacs package.

I notice that puz files are not opened with the correct encoding. At best (1) some hints and copyright symbol will be garbled. At worst (2) the puz file's width cannot be read correctly and crossword-summary will abort.

Recipe for duplicating (1):

  1. On Debian 10, Emacs 27.1 self-compiled.
  2. Start emacs with export LANG=ja_JP.UTF-8 ; export LANGUAGE=ja_JP:ja ; emacs -nw
  3. Download Matt Jones's crossword for 2021-01-21 using crossword-download.
  4. Open crossword-summary.
  5. Open the puzzle file by pressing Enter.
  6. Notice the copyright symbol is not correct. Also the hint for Across 24 does not show the é character, but \xe9.

Recipe for duplicating (2): Cannot re-verify at the moment. It was on an Ubuntu 16.04 with Emacs 27.1 self-compiled.

Possible Cause: puz files are opened with an encoding which is user-config dependent.

Possible solution: Locally set the coding-system-for-read before using insert-file-contents. Define a package default variable that holds the desired coding for known puzzle sources.

(defvar crossword-puzzle-file-coding 'iso-8859-1)

;; in crossword--start-game-puz
(let ((coding-system-for-read crossword-puzzle-file-coding))
  (insert-file-contents (setq crossword-file puz-file)))

;; in crossword--summary-data-puz
(let ((coding-system-for-read crossword-puzzle-file-coding))
  (insert-file-contents file))

Workaround 1: Define the default coding for read for "*.puz" files.

(with-eval-after-load 'crossword
  (add-to-list 'file-coding-system-alist '("\\.puz\\'" . iso-8859-1)))

Version: crossword.el was version crossword-20210126.1409 from MELPA or Package-Commit: a0bbd801c0e1d9c93561f8f610cc76b570a939ca.

Other notes: Once puz files are decoded correctly, puz-emacs files will be saved with the correct string data.

Reference: According to https://code.google.com/archive/p/puz/wikis/FileFormat.wiki the (de facto?) encoding of puzzle files is iso-8859-1.

All strings are encoded in ISO-8859-1 and end with a NUL.

Boruch-Baum commented 3 years ago

On 2021-01-28 06:04, piyo wrote:

Thanks for this enjoyable Emacs package.

Quite welcome!

I notice that puz files are not opened with the correct encoding. At best (1) some hints and copyright symbol will be garbled.

I thought I fixed that a while ago, What version of the package are you using?

At worst (2) the puz file's width cannot be read correctly and crossword-summary will abort.

That I haven't seen before. Now for sure I want to read the rest of the report...

Recipe for duplicating (1):

  1. On Debian 10, Emacs 27.1 self-compiled.

Very nice. I'm using a version of debian, and am stuck with the official 26 package (although I have a development snapshot deb from 2020-09~).

OFF-TOPIC: Are you seeing screen-flicker when using the crossword package with emacs 27?

2. Start emacs with export LANG=ja_JP.UTF-8 ; export LANGUAGE=ja_JP:ja
   ; emacs -nw

It looks like I didn't need to create the locale. I guess because of my development work for emacs-w3m...

Another off-topic question: Do you have non-English crossword puzzle files? I'd like to get a source of them in order to further test / develop the package.

3. Download Matt Jones's crossword for 2021-01-21 using
   crossword-download.
4. Open crossword-summary.
5. Open the puzzle file by pressing Enter.
6. Notice the copyright symbol is not correct. Also the hint for
   Across 24 does not show the é character, but \xe9.

I do see what may be a Japanese character in place of the copyright symbol, so that's a bug; however, I do get the correct é character in clue 24 across.

Recipe for duplicating (2): Cannot re-verify at the moment. It was on an Ubuntu 16.04 with Emacs 27.1 self-compiled.

Maybe a different variant of the JP locale?

Possible Cause: puz files are opened with an encoding which is user-config dependent.

Sounds right.

Possible solution: Locally set the coding-system-for-read before using insert-file-contents. ...

I've tried a version of this out, and it does the job. I don't see any untoward side effects, so I would like to push the commit, crediting you. How do you want to listed in the Changelog file (ie. name, email)?

My differences are to use a defcustom instead of defvar to be more user-friendly if it needs to be changed on-the-fly, and to set the locally-scoped definition in the pre-existing 'let's.

I expect this to get more complex as I get more feedback from people using the package for non-English language puzzles. It may be that I eventually need to do something like add an element to 'crossword-download-puz-alist' for encodings and then upon reads compare file names to the file-specs in that list in order to get that file's encoding.

Reference: According to [1]https://code.google.com/archive/p/puz/wikis/FileFormat.wiki the (de facto?) encoding of puzzle files is iso-8859-1.

I don't think I ever saw that document before. Thank you. It seems a much more thorough analysis than anything I've read before. The page doesn't display in emacs-w3m because the page is so javascript-dependent, but for this I went ahead and used another browser, and it looks quite worth the effort.

-- hkp://keys.gnupg.net CA45 09B5 5351 7C11 A9D1 7286 0036 9E45 1595 8BC0

piyo commented 3 years ago

I see the commit you added for this issue. You may want to git-amend (or magit-reword) it after this reply, sorry.

At worst (2) the puz file's width cannot be read correctly and crossword-summary will abort.

I have the exact steps to repro this problem (starting from emacs -Q), and it has nothing to do with LC_ALL and LANGUAGE environment variables. However your commit should essentially fix the problem. Here is the code:

;; Steps to reproduce:

;; Command line to start Emacs:
;; env LANG=en_US.UTF-8 LANGUAGE=en_US:en emacs -Q -nw

;; run each command one at a time, i.e C-x C-e (eval-last-sexp)
(progn
  (setq crossword-save-path (expand-file-name (substitute-in-file-name "$HOME/.cache/emacs/var/crossword/")))
  ;; Previously, the puzzle 2020-12-31 from Matt Jones was downloaded via (crossword-download).
  ;; crossword-save-path already contains "jz201231.puz" and nothing else
  (setq dbg-crossword-load-path (substitute-in-file-name "${HOME}/.local/share/emacs/27/site-lisp/elpa/crossword-20210126.1409"))
  (setq load-path (append load-path (list dbg-crossword-load-path)))
  (load (expand-file-name "crossword-autoloads.el" dbg-crossword-load-path))

  ;; user setting:
  (set-coding-system-priority
   'utf-8
   'cp932 ;; this encoding wins for "jz201231.puz"
   'euc-jp
   'iso-2022-jp
   'iso-8859-1 ;; however, *.puz needs this, but is not evaluated
   )
  (toggle-debug-on-error)
  ;; Expect: shows 1 entry.
  ;; Actual: shows no entries. error dialog.
  (crossword-summary) ;; bug repro'd here.

  ;; Additional debug
  ;; (dired crossword-save-path)
  ;; (dired dbg-crossword-load-path)

  ;; Expect: summary forms about jz201231.puz.
  ;; Actual: (nil)
  (find-file-other-window (expand-file-name "puz-emacs.data" crossword-save-path))

  (find-file-other-window (expand-file-name "jz201231.puz" crossword-save-path))

  ;; Expect: contains "© 2020 Matt Jones"
  ;; Actual: contains "ゥ 2020 Matt Jones"

  ;; Expect: 15
  ;; Actual: 80
  (with-current-buffer "jz201231.puz"
    (char-after 45))

  ;; Expect: iso-latin-1-unix (iso-8859-1-unix).
  ;; Actual: japanese-cp932-unix.
  (with-current-buffer "jz201231.puz"
    (require 'mule-diag)
    (if (local-variable-p 'buffer-file-coding-system)
        (print-coding-system-briefly buffer-file-coding-system)
      (princ "Not set locally, use the default.\n")))
  )

;; Steps for expected results

;; command line to start emacs:
;; env LANG=en_US.UTF-8 LANGUAGE=en_US:en emacs -Q -nw

(progn
  (setq crossword-save-path (expand-file-name (substitute-in-file-name "$HOME/.cache/emacs/var/crossword/")))
  (setq dbg-crossword-load-path (substitute-in-file-name "${HOME}/.local/share/emacs/27/site-lisp/elpa/crossword-20210126.1409"))
  (setq load-path (append load-path (list dbg-crossword-load-path)))
  (load (expand-file-name "crossword-autoloads.el" dbg-crossword-load-path))
  ;; do not (set-coding-system-priority ...)
  ;; or:
  (set-coding-system-priority
     'utf-8
     'iso-8859-1 ;; this encoding wins for "jz201231.puz"
     'cp932  ;; this encoding is never evaluated
     'euc-jp
     'iso-2022-jp
     )
  (toggle-debug-on-error)
  (crossword-summary) ;; no problem, shows 1 entry.
  )

What version of the package are you using?

crossword.el was version crossword-20210126.1409 from MELPA or Package-Commit: a0bbd80.

I do see what may be a Japanese character in place of the copyright symbol, so that's a bug; however, I do get the correct é character in clue 24 across.

This is strange, I either get both wrong, or with the workaround, both correct. Please see the repro steps above including the part where I open the puz file.

I would like to push the commit, crediting you. How do you want to listed in the Changelog file (ie. name, email)?

Thank you for the credit in that commit, that's fine.

My differences are to use a defcustom instead of defvar to be more user-friendly if it needs to be changed on-the-fly, and to set the locally-scoped definition in the pre-existing 'let's.

Looks good for now. I would rather reduce the effect of coding-system-for-read but then you probably don't need to read other files in the same function, right? Also I would just make the functionality of reading a puz file into a temporary buffer into a separate function so that people can defadvice just that part.

I expect this to get more complex as I get more feedback from people using the package for non-English language puzzles. It may be that I eventually need to do something like add an element to 'crossword-download-puz-alist' for encodings and then upon reads compare file names to the file-specs in that list in order to get that file's encoding.

I was going to speculate but nah. YAGNI, until you do. I suspect supporting other single-byte encodings will be easier than multibyte encodings, though.

Reference: According to [1]https://code.google.com/archive/p/puz/wikis/FileFormat.wiki the (de facto?) encoding of puzzle files is iso-8859-1. I don't think I ever saw that document before. Thank you. It seems a much more thorough analysis than anything I've read before.

It's a link in a link in your README.md. -> [2] http://fileformats.archiveteam.org/wiki/PUZ_%28crossword_puzzles%29 -> https://code.google.com/archive/p/puz/wikis/FileFormat.wiki

OFF-TOPIC: Are you seeing screen-flicker when using the crossword package with emacs 27?

I am using a terminal emulator, so no. I.e. Windows 10 -> Msys mintty.exe -> Debian 10 -> screen -> Emacs 27.1 -> crossword.el. Glorious wide vision,195 by 60 with HiDPI fonts. ;-P

Another off-topic question: Do you have non-English crossword puzzle files? I'd like to get a source of them in order to further test / develop the package.

No. I was curious and I searched in the language I can use, Japanese. The first couple of hits just use hiragana and katakana, but of course the hints use the full Japanese language (kanji). Like http://cross-word.info/ (I lost a couple of minutes just playing with this! heh) If I could play these with the excellent interface your package provides I would be happy. But alas it would need multibyte support in this modern era.

Boruch-Baum commented 3 years ago

On 2021-01-29 13:07, piyo wrote:

I see the commit you added for this issue. You may want to git-amend (or magit-reword) it after this reply, sorry.

I haven't yet gotten to it. Maybe now is a good time...

Thank you for the credit in that commit, that's fine.

OK. You can change your mind at any time.

Also I would just make the functionality of reading a puz file into a temporary buffer into a separate function so that people can defadvice just that part.

Another thing to look at maybe now.

I suspect supporting other single-byte encodings will be easier than multibyte encodings, though.

For this very reason, I would very much like to get sample puz files found 'in the wild' for non-English solutions.

 Reference: According to
 [1][2]https://code.google.com/archive/p/puz/wikis/FileFormat.wiki

It's a link in a link in your README.md.

Ha. I might have seen it in year 2018 and forgotten it, or maybe I just ignored because it requires JS. Anyway, I did look at it, and found that it's html was mangling what were clearly intended to be tables and code blocks, so I reformatted the content as an org-mode file, now bundled in the git repository.

I was curious and I searched in the language I can use, Japanese. ... If I could play these with the excellent interface your package provides I would be happy. But alas it would need multibyte support in this modern era.

No commitment, but that is of interest to me to do.

I've just a few minutes ago pushed several commits. One is a bug-fix unrelated to our discussion.

Regards,

-- hkp://keys.gnupg.net CA45 09B5 5351 7C11 A9D1 7286 0036 9E45 1595 8BC0