SvarDOS / bugz

SvarDOS bug tracker
http://svardos.org/
6 stars 0 forks source link

replace TREE with PDTREE #127

Open mateuszviste opened 1 week ago

mateuszviste commented 1 week ago

SvarDOS comes currently with TREE v3.7.2 from the FreeDOS project, itself being a 1995 commercial code by Dave Dunfield, GPL-ized somewhere in the 2000s.

There is also pdTree: a public domain version written by @PerditionC https://web.archive.org/web/20011212151852/http://www.darklogic.org/fdos/projects/tree/ https://github.com/FDOS/tree

Glancing at the source code it appears to be a very clean and memory-efficient implementation. Perhaps SvarDOS could use it instead of the current GPL tree after some minor work:

boeckmann commented 6 days ago

Well, that bullet list doesn't exactly look like "minor work". But if you think its worth it, go for it. Maybe we should put a counter anywhere which counts the packages still being GPL licensed, and try to converge that towards zero :)

PerditionC commented 6 days ago

it should build with ow given it builds with msvc, but I will check it this week. note I have a pd implementation of cats that I plan on switching out the lgl one with and releasing as fd tree 4.0. It's been a while, I'm guessing it builds with a batch file. Adding a make based build for ow should be trivial, so I will check on it. I guess I need to make sure all translations are updated, both versions use same files for translations. There is no c++ other than maybe some variable scoping, I believe it was a naming conflict resolution. The api is based on win32 findfile, so removing lfn support likely won't really reduce size much.

I have too look, may need to add a findfile using ow pragma aux.

(off topic, but fd format is switching back to small model, but small penalty to copy strings from far buffer to near accessible one).

I will try to get some updates on GitHub later this week.

mateuszviste commented 6 days ago

Maybe we should put a counter anywhere which counts the packages still being GPL licensed, and try to converge that towards zero :)

I did think about it for a short time, but ultimately dropped the idea because I do not want to give the impression that "fighting" with GPL is one of SvarDOS goals.

I plan on switching out the lgl one with and releasing as fd tree 4.0.

So pdTree will soon become the default FreeDOS tree, replacing the 3.7.2 version from Dave Dunfield? Cool. Is there any practical reason for it?

BTW the FreeDOS listing is somewhat confusing: it lists TREE 3.7.2 by Dave Dunfield, but provides a link to your PD version: https://www.ibiblio.org/pub/micro/pc-stuff/freedos/files/distributions/1.2/repos/pkg-html/tree.html

I have too look, may need to add a findfile using ow pragma aux.

It's exactly what I was planning to do :) Great to hear that you are still interested in this 20-years old code!

mateuszviste commented 5 days ago

I started doing some maintenance & trimming on pdTree here: http://svn.svardos.org/log.php?repname=SvarDOS&path=%2Ftree%2Ftrunk%2F&isdir=1

So far I removed most of Windows-specific stuff, utf8 output, CATGETS, and wide characters support. I have also replaced all C++-isms by ANSI C equivalents, so this fork of pdTree is 100% C now.

mateuszviste commented 3 days ago

After some more hours of hacking, slashing and reformatting, SvarDOS TREE is finally functional. It compiles with OpenWatcom with a proper makefile, loads its translations via SvarLANG and appears to work reliably as far as I can tell. I removed a LOT of code (no LFN, no Windows support, no Unicode...), simplified many parts of it and replaced some functions with OpenWatcom-specific calls.

UPXed it is a 11K COM file now. Not as small as I want it to be, but there is still hope to get a binary smaller than the GPL FreeDOS TREE (10K) once I get rid of printf. I will probably not go the WMINCRT route, though, since I rely heavily on OpenWatcom's libc now.

Two things that I have yet to do:

boeckmann commented 3 days ago

Well done :) We could spare some bytes by not writing the default language to the .lng file. We could adapt tlumacz to at least optionally skip outputting the default language. For FDISK, for example, that would save over 10k bytes. And I am sure that I will recompile the program when I have to update the default language anyway.

mateuszviste commented 3 days ago

We could spare some bytes by not writing the default language to the .lng file.

Yes but then it would not be possible to re-load EN from within the application (if the application supportes such reloads). This is used by the SvarDOS installer.

mateuszviste commented 3 days ago

a win would be to have each language in the LNG file optionally compressed. It's human language, lots of redundancy. There must be some algorithm nowadays that is capable of in-place depacking with a reasonably small depacking code.

boeckmann commented 3 days ago

I took the opportunity to make the makefile compatible with Linux. Should still work fine under DOS (at least in DosBox it does).

ecm-pushbx commented 3 days ago

a win would be to have each language in the LNG file optionally compressed. It's human language, lots of redundancy. There must be some algorithm nowadays that is capable of in-place depacking with a reasonably small depacking code.

You may want to look at my use of heatshrink in my Extensions for lDebug, extpak.eld and list.eld. You need a buffer the size of the depack window. If you access the compressed file in a linear way then you can re-use the same depacker state across several calls into the depacker.

ecm-pushbx commented 3 days ago

Main depacker is in https://hg.pushbx.org/ecm/ldebug/file/a35f88de973a/source/eld/depack.asm

mateuszviste commented 3 days ago

Main depacker is in https://hg.pushbx.org/ecm/ldebug/file/a35f88de973a/source/eld/depack.asm

Seems a bit complicated. I was thinking about something much simpler. Maybe a custom algorithm that would read bytes from the data file and when spotting a special marker (say, 0xFF O S), it would know that it has to copy now S bytes from offset -O in past stream. Probably not the most efficient approach, but should be easy to implement in any language, and might provide good enough results for text strings. I might investigate this in incoming days, once I am done with SvarTREE.

mateuszviste commented 2 days ago

I committed an experimental change to SvarLANG: TLUMACZ outputs now two versions of the LNG file: one as usual (OUT.LNG) and the other one compressed (OUTC.LNG). SvarLANG is able to read both. Older versions of SvarLANG will not crash on it, just ignore the compressed languages.

The compression scheme is very simple, I invented it during a coffee break and implemented within an evening hour. It is not meant to be state of the art - just somewhat better than uncompressed text and simplest possible to decompress with no extra memory buffer. It assumes highly redundancy of comoressed data, should work well with text, but attempting to compress binary things is likely to produce "compressed" data twice larger than the original.

It works on a simple example, I will proceed with further tests tomorrow or next week. There is also one or two tweaks I'd like to check.

mateuszviste commented 2 days ago

Some preliminary results:

TREE still works after compression. FDISK I could not check because for some reasons I am not able to compile it (unrelated to SvarLANG).

mateuszviste commented 2 days ago

SVARCOM.LNG compresses from 59K to 35K. Works properly.

This looks quite good. I think I will publish a new SvarLANG version today or tomorrow, and then migrate SVARCOM to it. I just need to test it on some real & ancient (286) hardware first to make sure it is not too slow.

boeckmann commented 2 days ago

compressing FDISK.LNG makes the string resources over twice smaller (72K uncompressed, 35K compressed).

Awesome :) Time then for a new FDISK version.

FDISK I could not check because for some reasons I am not able to compile it (unrelated to SvarLANG).

Can you give some details about the build environment? Then I can try to reproduce.

boeckmann commented 2 days ago

Btw. the current FDISK.LNG (version 1.3.16) is 104k. May it be that you are working with an older FDISK source?

boeckmann commented 2 days ago

Or is it only the strings you are counting?

mateuszviste commented 2 days ago

May it be that you are working with an older FDISK source?

Yes, I took some old source tree that was laying on my HDD, I did not have enough courage to study git again to find how to checkout the latest code. I will download the latest source as a zip file and try again.

mateuszviste commented 2 days ago

tested with latest FDISK source tree. With SvarLANG's MVCOMP the strings are: uncompressed = 102K compressed = 49.7K

FDISK works fine (help screen and main screen display correctly).

boeckmann commented 2 days ago

Perfect! Do you mind if I add a /x switch to tlumacz to exclude the default lang? Would save another 6-8k in case of FDISK...

mateuszviste commented 2 days ago

Perfect! Do you mind if I add a /x switch to tlumacz to exclude the default lang? Would save another 6-8k in case of FDISK...

Be my guest. :)

I think I am done with SvarLANG for now, MVCOMP is working unexpectedly well for such a quick hack.

mateuszviste commented 1 day ago

I have moved mvcomp compression of lang blocks to a dedicated switch: TLUMACZ /comp When compression is enabled, it will actually compress the lang block only if it is beneficial (saves at least one byte) - so using /comp will never be worse than uncompressed LNG (worst case it will be left uncompressed).

mateuszviste commented 1 day ago

Having the compression flag on a per-language basis allows to have some languages in the LNG compressed and other not. I was not sure if this was useful to do it like that, but now I see it was a good decision. In the case of TREE, some languages can be slightly compressed, while other do not, so TLUMACZ applies compression only where it makes sense:

lang EN mvcomp-ressed (949 bytes -> 942 bytes)
lang DE mvcomp-ressed (1001 bytes -> 990 bytes)
lang ES mvcomp-ressed (1091 bytes -> 1032 bytes)
lang FI left UNCOMPRESSED (uncomp=943 bytes ; mvcomp=956 bytes)
lang LV left UNCOMPRESSED (uncomp=948 bytes ; mvcomp=1034 bytes)
lang PT mvcomp-ressed (1019 bytes -> 978 bytes)
lang RU left UNCOMPRESSED (uncomp=959 bytes ; mvcomp=1082 bytes)
lang TR left UNCOMPRESSED (uncomp=985 bytes ; mvcomp=1038 bytes)
mateuszviste commented 1 day ago

Perfect! Do you mind if I add a /x switch to tlumacz to exclude the default lang? Would save another 6-8k in case of FDISK...

saves exactly 6.7K. I have added /excref to TLUMACZ.

boeckmann commented 1 day ago

Oh, ok then we did it both :D

boeckmann commented 1 day ago

Also interesting: ZIP only gets the compressed FDISK.LNG 12%(!) smaller while compressing it when wmake dist is called.

boeckmann commented 1 day ago

Coming within 12% of ZIP is quite remarkable for such a "simple" algorithm. And a fair bit of it will be attributed to the compression of the dictionary, I think.

boeckmann commented 1 day ago

Regarding the tree makefile: did rm -f actually fail for you under DOS? Because this is a WMake internal and even on DOS should work perfectly fine, according to the manual.

boeckmann commented 1 day ago

Regarding the tree makefile: did rm -f actually fail for you under DOS? Because this is a WMake internal and even on DOS should work perfectly fine, according to the manual.

See https://open-watcom.github.io/open-watcom-v2-wikidocs/ctools.pdf, page 155

mateuszviste commented 1 day ago

Oh, ok then we did it both :D

ooops, I thought you're off for today, and since I wanted to push a new SvarLANG today I added the excref myself, sorry.

Coming within 12% of ZIP is quite remarkable for such a "simple" algorithm.

The inventor must be a genius. :) No, but seriously this was an easy case. As soon as I realized that pretty much 100% of the strings will be compressible (lack of randomness), it became obvious to me that I have to take advantage of this and make the backreference syntax as short as possible, leaving uncompressed data to be the "expensive special case". With a more random data (non-text) MVCOMP behaves poorly and usually makes the output larger than it originally was (with no redundancy at all, every byte of input is written as two bytes). Then the only choice left was how far I need to go with backreference offsets vs how much bytes I can match, and set the cursor in an optimal place. For most human languages it appears that matching repetitions of 16 characters is more than enough, which left 12 bits for the backref offset (ie. a window of 4K). Different ratios were behaving worse, at least when ran against FDISK.LNG.

Shortly said, MVCOMP is good with text, but that's the only thing it can do. ZIP is much, much more capable & universal.

And a fair bit of it will be attributed to the compression of the dictionary, I think.

Yes, the dictionary is most probably what ZIP compresses. MVCOMP is very bad at this, too low redundancy level.

Regarding the tree makefile: did rm -f actually fail for you under DOS?

You are actually right! At some point wmake failed because all the nls/??.txt files were present. I quickly looked into the makefile and saw the rm -f so I though it was the culprit. But I re-tested now and rm -f in the makefile does indeed work in DOS, so my issue must have been different, maybe a failed tlumacz or something. I will revert the universal makefile then. Thanks for noticing!

boeckmann commented 1 day ago

Coming within 12% of ZIP is quite remarkable for such a "simple" algorithm. And a fair bit of it will be attributed to the compression of the dictionary, I think.

Turned out that my conclusion was not valid, as I compared the sizes of the compressed LNG with the "compressed compressed" ZIPPED LNG. But ZIP yields better compression when working on the uncompressed LNG file. The results are:

uncompressed: 90.340 bytes
MVCOMP LNG: 43.944 bytes
uncompressed ZIP: 32.523 bytes

So the ZIP is actually ~25% smaller, but in my opinion not worth the runtime and size overhead it introduces...

boeckmann commented 1 day ago

I think the svarlang.txt should be updated to reflect the changes in the format.

mateuszviste commented 1 day ago

uncompressed ZIP: 32.523 bytes So the ZIP is actually ~25% smaller, but in my opinion not worth the runtime and size overhead it introduces...

Deflate is quite memory hungry, yes.

But your comparison is not exactly valid anyway because you compress the entire LNG as a single stream, while MVCOMP must compress each language separately so the application does not have to process all languages but only its own. For example ZIP starts with EN, and when it gets to DE it already has a big dictionnary that is likely to match DE since the syllabs are mostly the same. It's a big help.

Also, there is a way to improve MVCOMP's efficiency by introducing a "next x bytes are uncompressed" prefix. I have weakly tested such approach but dropped it since it immediately made the depacking routine more complicated, making the svarlang code larger than the 1K limit I'd prefer not to cross so the lib stays "sexy" even for small commandline tools.

boeckmann commented 1 day ago

But your comparison is not exactly valid anyway because you compress the entire LNG as a single stream, while MVCOMP must compress each language separately so the application does not have to process all languages but only its own. For example ZIP starts with EN, and when it gets to DE it already has a big dictionnary that is likely to match DE since the syllabs are mostly the same. It's a big help.

Yes, also the search tables are not compressed by MVCOMP. But I think comparison is still valid in the sense of getting a "lower bound" and how near one can come to that.

I think you should also consider adding MVCOMP to AMB. Another intersting project would be combining MVCOMP with the .PAK file format to generate a truely "8086 compatible" compression archive format for text.

mateuszviste commented 1 day ago

I think you should also consider adding MVCOMP to AMB.

It crossed my mind. I will look into it.

Another intersting project would be combining MVCOMP with the .PAK file format to generate a truely "8086 compatible" compression archive format for text.

Too exotic for me, but maybe I will publish mvcomp as a tiny, standalone tool and an example library. Then whoever wants to use it can do so with all informations at hand. It would also make it easier to compare mvcomp with zip or other solutions, out of curiosity. And maybe I will spot some ways to improve it then, without making it more complex.

Anyway, on the short term I really need to focus on TREE so I can finally push it out of my mind. I'm too easily distracted I guess.

mateuszviste commented 1 day ago

I thought about an mvcomp improvement last night, implemented it for test now. It allows to mark streams of up to 31 bytes as "uncompressed" and takes only a couple of code bytes more in the depacking runtime. It's committed in trunk. The new code is still able to depack earlier mvcomp files.

This mvcomp improvement leads to an about 3-4% compression gain. And, most importantly, it practically eliminates situations where the compressed output is larger than the input (can still happen, but only with very small files). Tested with TREE and FDISK - works. Plus, all the TREE languages are compressible now (earlier half of the TREE languages were uncompressible). FDISK.LNG went from 49.6K to 47.9K.

I have also made a test creating an FDISK.LNG file that contains only DE. I compressed it with mvcomp1 (from yesterday), mvcomp2 (from today) and zip -9. Having only a single language makes the ZIP comparison fair because then ZIP cannot "cheat" by reusing the same dictionary for multiple languages. Results:

FDISK_DE.LNG uncompressed = 14'845 bytes (100%)
FDISK_DE.LNG with mvcomp1 =  7'126 bytes (48%, ie. 52% better than uncompressed)
FDISK_DE.LNG with mvcomp2 =  6'856 bytes (46%, ie. 4% better than mvcomp1)
FDISK_DE.LNG with ZIP -9  =  5'800 bytes (39%, ie. 15% better than mvcomp2)

mvcomp2 seems to work fine, but I do not trust entirely my packing routine yet. I will have to test it more (meaning no SvarLANG update today, and probably not tomorrow).

mateuszviste commented 21 hours ago

Found a quick win to gain an extra percent without changing the MVCOMP format nor the depacker code, just a compressor optimization that leverages the cheapness of the "literal string" feature I've added today. FDISK.LNG is 47.2K now.

Updated "FDISK DE" test results:

FDISK_DE.LNG uncompressed = 14'845 bytes (100%)
FDISK_DE.LNG with mvcomp1 =  7'126 bytes (48%, ie. 52% better than uncompressed)
FDISK_DE.LNG with mvcomp2 =  6'752 bytes (45%, ie. 5% better than mvcomp1)
FDISK_DE.LNG with ZIP -9  =  5'800 bytes (39%, ie. 14% better than mvcomp2)