brouhaha / tapeutils

GNU General Public License v2.0
22 stars 4 forks source link

Make -b work better with ascii files #17

Closed bictorv closed 1 year ago

bictorv commented 1 year ago

Problem: when extracting e.g. the Panda tape image with -b to get all the files, many ascii files get trailing NULs (those whose length is not a multiple of 5 bytes) even when they don't actually contain trailing NULs on the tape file. The file length when read is also wrong, rounded up to a multiple of 5.

Example: with https://github.com/PDP-10/panda/blob/master/tape-image/panda.tap.bz2 uncompressed, do read20 -c -b -x -v -f panda.tap -e exec1.mac The resulting file (exec-sources/exec1.mac) should have length 151281, but instead ends up with length 151285 where the last bytes are NULs.

This leads to at least two problems:

  1. the extracted data is not what the tape actually contains
  2. the extracted files confuse tools like diff, or Github Desktop, which doesn't want show diffs (because it thinks the file is binary).

The reason for the trailing NULs is that with -b, all files are extracted as binary (bytesize 36) even though the FDB on the tape says they are 7- or 8-bit..

The suggested fix is to

  1. with -b, only hack the bytesize of files if it isn't 7 or 8
  2. change the order of the cases in doDatablock to only treat files as binary if they aren't 7- or 8-bit.

The result is that when extracting files from tape images, the files get the correct length without trailing NULs.

larsbrinkhoff commented 1 year ago

I introduced the -b switch, so I should explain my thinking. I wanted a way to extract all files with a uniform format that preserves all bits, and also renders ASCII text readable whilst keeping CR LF line endings as in the original data. read20 had -T and -c which addressed some of those concerns, but not all. So I added -b which does a simple transformation for all files. Please keep -b as it is.

Regarding the specific case of the https://github.com/PDP-10/panda/ repository, I created it to put an archival copy of the Panda distribution on GitHub. I included the extracted files for ease of searching and downloading individual files, should anyone want to do so. I think it would make sense to have another repository with files in a more convenient arrangement for development. Text files could have any format, and binary files would possibly not be necessary.

many ascii files get trailing NULs (those whose length is not a multiple of 5 bytes) even when they don't actually contain trailing NULs on the tape file. The file length when read is also wrong, rounded up to a multiple of 5.

I agree this is a problem that should be fixed, and I do see a straightforward solution for 7-bit files. Just truncate the output file to the right size. This is not a problem when converting text back to 36-bit data, because a word will be padded out with zeroes. -b will still not work nicely with 8-bit files, but I hope this would not be much of a problem. Does Panda include any of those?

larsbrinkhoff commented 1 year ago

(Parenthetically, my own implementation of dumper will not output the additional NULs. I wrote this version because it handles a wider range of tape formats.)

bictorv commented 1 year ago

Could you elaborate on what problem my fix introduces? If it is related to 8-bit files it's easy to let that get the old behaviour, but I really don't see the problem. (There are a few 8-bit files on the Panda tape, grep for ' 8 7' in the listing, mainly in the unix part.)

Again: what you get on disk after reading with the original -b, is NOT what is on tape.

With the original -b behaviour, to get a real copy on disk of what is on the tape, you need to first read the tape with -b, and then again without it (but with -c, of course).

larsbrinkhoff commented 1 year ago

Your fix uses getstring to extract 7-bit bytes from the tape. It ignores bit 35, which can sometimes be a 1.

I added -b to have a simple transformation applied uniformly to all files. This makes it easier to analyze data in files without having to figure out if they are text or binary. Please don't change what -b does. Except correcting the file length, which I will do.

bictorv commented 1 year ago

Ah, bit 35.