`postscript` is read in as bytes but expects to be written out as a string - Githubissues

alexdej / puzpy

Python library for reading and writing across lite crossword puzzle .puz files.

MIT License

112 stars 32 forks source link

`postscript` is read in as bytes but expects to be written out as a string #18

Closed thisisparker closed 5 years ago

thisisparker commented 5 years ago

When reading a file that ends with "extra garbage" (as described in L207-208), that data is stored internally as bytes. It appears that the tobytes function later expects it to be a string.

I'm not sure what the purpose of postscript is—whether it's just for roundtripping insignificant data, or what—so I don't know for sure whether the best approach is

to change the return value of PuzzleBuffer.read_to_end() to be a string (as is the case with PuzzleBuffer.read_until())
to convert postscript to a string when it's assigned in Puzzle.load(), or
to continue to store it as bytes and just drop the .encode(ENCODING) from the tobytes() step on L273.

If one of these is preferable, I'd be happy to submit a PR to accomplish it. Even if none are suitable, I still would encourage moving the self.tobytes() step outside of the open context manager in Puzzle.save(), because in this case a failure at that step is causing the file to be opened empty and could result in data loss. (I've got a user experiencing that on an older set of puzzles.)

thisisparker commented 5 years ago

Oh also: postscript is initialized as an empty string (not bytes) in new blank puzzles, so I'm inclined to lean towards one of the first two options.

alexdej commented 5 years ago

I appreciate the detailed report and analysis here. IMO read_to_end() is likely correct to return bytes not string. But I am tempted to convert postscript to string when the puzzle is loaded as you suggested. There are some potential compatibility issues changing the type now since it's hard to know what's in this field in the wild as I don't think I've seen it specified anywhere. What do you think about guarding the conversion to string and falling back to bytes? (then similarly in tobytes, detecting bytes vs string and handling appropriately). PR absolutely welcomed, and for the save issue you mentioned too, if you have time! Thanks for your contribution!

thisisparker commented 5 years ago

The more I think about it, the more inclined I am to have it follow preamble, which has changed between the latest release to pypi and now, given that they appear to play pretty similar roles on each end of the file. In that spirit, I'm going to submit a PR that treats it as bytes throughout, and catches it at save if it's been written as a string for some reason. (Sorry, I know that's the opposite of my suggestion from yesterday.)

I suspect that writing data to postscript—whether as a string or as bytes—is not a very widespread use in the wild, precisely because that field is not specified in any of the documentation I've seen for the format.

If you don't like my approach, let me know and I can do it the other way instead.

alexdej commented 5 years ago

the symmetry with preamble makes a lot sense to me too. let me take a look at your PR.

alexdej commented 5 years ago

Looks good! thanks for your contribution!

thisisparker commented 5 years ago

Thank you for being so responsive and quick! Do you happen to have a pypi release planned so I can pin to the new version?

alexdej commented 5 years ago

In the past @svisser handled pypi but it's been a while since I've heard from him. Let me see if I'm able to do a release.

alexdej commented 5 years ago

OK I think I figured it out lol. Good luck with your project.