Add c65 sub for py65 with blockio + make ctests (> 100x faster on current test suite)

patricksurry commented 6 months ago

Here's a c-based substitute for py65mon which also supports blockio and heatmap profiling. First build it (requires gcc):

cd c65
make

Now run the regular interactive taliforth session:

c65/c65 -r taliforth-py65mon.bin

You can run the test suite with make ctests which will produce tests/results.txt in about 1.2s compared to 160s on my macbook (100x faster). git diff on the results seem very close to talitest.py other than the header lines between files, and the cycle counting stuff at the end (which could probably be extracted here).

As a bonus you get read/write coverage data dump so you can see hotspots (see c65/profile.ipynb as an example).

The blockio support is explained in the README, but it's very easy. I use it like this c65/c65 -r taliforth-py65mon.bin -b data.blk where data.blk is a file with binary data, then this word:

\ read (action=1) or write (2) one 1024 byte block number blk to/from address buf
: blkrw ( blk buf action -- )
    -rot $f014 ! $f012 ! $f010 !
;

SamCoVT commented 6 months ago

I'll want to see if this passes the Klaus Dorman (https://github.com/Klaus2m5/6502_65C02_functional_tests) test suite, which is a pretty exhaustive test for 65(C)02 simulators, before switching to it, but it looks pretty good. I'm happy to see a compatible license for fake65c02.h (Tali is also CC0/Public Domain). Do you know what the licensing status of c65.c is? Is that something you wrote?

It looks like the cycle counting could be implemented easily to allow the entire test suite to be run. I had to add it in python to py65, and I think I see where it could be added in C.

How did you feed the tests to c65?

In the Makefile, is the executable named c65 or prof65 (or am I missing something there)?

patricksurry commented 6 months ago

Sorry prof65 was a holdover on my part; i renamed as c65 as I added to this PR.

I wrote c65.c, happy to grant whatever license TaliForth likes. The header file is unchanged from the linked repo.

I ran and compared test output like this:

cd c65
make
cd ..
time make ctests
git diff upstream/master-64tass -- tests/results.txt

The diff looks good ignoring some header comments between tests, and the cycle counting.

Cycle count is available as int ticks within c65 as the simulator executes.

It would be trivial to write that to a some fixed memory[] location as a double word if you wanted to access it from forth within the simulator, or vice versa you could add a magic memory location which the sim could hit to log the current cycle count. I didn't look at how the cycle counting is currently working in the tests.

SamCoVT commented 6 months ago

I don't think I'm going to have enough free time to get to testing this thoroughly this weekend, so this will PR will likely stay open for a few weeks. You're welcome to push more updates in the meantime.

This will be adding an extra dependency (A C compiler - which is a given on Linux, but not on Windows and I don't think on Mac (unless you've installed the Xcode tools)), but it may be worth it for the speedup. Tali in py65mon works on Linux, Mac, and Windows (tested native and in Cygwin - haven't tested in WSL yet) on both 2.7 and 3.x flavors of Python. I know Windows, especially, is funny about its console I/O, so I'll want to make sure that works as expected.

What platform are you working on? I have access to Windows 10, 11, and Linux.

SamCoVT commented 6 months ago

The license for Tali is CC0 Public Domain. Ideally all of the software that Tali comes with would have this license, although we do have one item in the forth_examples folder that has a different license. If you are OK with that, then CC0 is the license I'd like to use.

If you want to see how the cycle count works, you can look in tests/talitest.py, which extends py65mon. The cycle counting uses addresses $F006-$F00B (I think just just fits without overlapping your block I/O), with a read to $F006 starting the cycle counting, a read to $F007 stops the cycle counting and then the result can be read from $F008-$F00B, but it's in NUXI format (you can ask the interwebs about nUxi endianness if you are not familiar). You can look at read_cycle_count and the comment and code show the byte order - this allows Tali to read it directly out of memory as a Forth double value.

I also overrode the py65mon I/O so I could spoon feed from the test files and capture the results. The tester program checks to see if Tali crashed before reading all of the input and also searches for specific error messages in the output and displays a summary of what went wrong at the end, after running all of the tests.

I think all of this can be rewritten in C, and we can have one version for general use and another that will be augmented for running the tests (perhaps c65 and talitest as the executable names)

patricksurry commented 6 months ago

I'm on mac os x

SamCoVT commented 6 months ago

Oh good - OSX is the only platform I don't have access to.

patricksurry commented 6 months ago

Added test headers plus cycle counting magic for c65. Now make ctests looks v close to make tests - c65 seems to clock slightly higher on some tests. Both seem to be trying to do the right thing for page boundaries and branches taken, not sure which is more accurate.

I looked manually at 0=. The diff says:

-5            ' 0=            cycle_test drop       CYCLES:     50 ok
+5            ' 0=            cycle_test drop       CYCLES:     52 ok

I count 43 (beq taken) or 45 (beq not taken) plus 4 for one of magic reads (lda $abs) which is 47 or 49. So different from both reports :)

.a790                   xt_zero_equal:       40+3 or 38+7 
.a790   20 25 d8    jsr $d825              10+6  jsr underflow_1        
.a793   b5 00       lda $00,x                 4  lda 0,x
.a795   15 01       ora $01,x                 4  ora 1,x
.a797   f0 04       beq $a79d                 2+  beq _zero
.a799   a9 00       lda #$00                  2  lda #0
.a79b   80 02       bra $a79f                 3+  bra _store
.a79d                   _zero:
.a79d   a9 ff       lda #$ff                  2  lda #$ff
.a79f                   _store:
.a79f   95 00       sta $00,x                 4  sta 0,x
.a7a1   95 01       sta $01,x                 4  sta 1,x
.a7a3   60      rts     z_zero_equal:   rts   6

It's slightly weird how pymon codes BEQ vs BRA - they should both be 2+branch taken+page cross so 3+ for BRA and 2++ for BEQ?

@instruction(name="BRA", mode="rel", cycles=1, extracycles=1)
@instruction(name="BEQ", mode="rel", cycles=2, extracycles=2)

SamCoVT commented 6 months ago

I agree that the cycle count for BRA looks wrong. You can file that as an issue with py65, but it may take a while for Mike to get to it. It looks like he hasn't been working on py65 recently.

The cycle counts are mainly a double check that things didn't get radically slower (or faster), which might indicate that something was broken while making a change. I'm not concerned with the exact values, as some of them will change any time the code moves around and different words end up crossing a page boundary (as you've already seen).

I've been able to play a bit with c65, and I have the following notes: Input is not handled character by character, but rather line by line. If running it interactively, it prints the line as you type, but then Tali prints it again while processing the characters.

Getting it to work character by character will be a hassle if you want it to work cross platform because Windows and Linux do that fundamentally differently, and OSX has some differences to Linux as well. You end up writing special code for all three platforms. It's doable (I did it for py65mon), but it's a real hassle and requires digging into some nitty gritty details, especially if you want to switch back to line editing mode.

If the input were handled, I could see that it shouldn't take too much effort to bring it up to approximately py65mon levels of functionality.

The part that is lacking is that the current test setup can generate a summary at the end that repeats the errors from all failed tests, as well as telling if Tali did not finish the tests. The former could just look at the output of c65, but the latter requires some method of telling if the tests did not finish. The most common reason for stopping early is hitting a BRK instruction - usually when something horribly breaks and the PC ends up in a place it's not supposed to be.

Your current testing solution uses pipes, but Windows does pipes differently enough that it may not be a good fit here. Adding an option c65 to feed (multiple) files as input to Tali would solve that issue.

Are you thinking that c65 is a solution to make just the testing for Tali2 go faster, or are you thinking of it as a total replacement for py65mon?

patricksurry commented 6 months ago

I don't think of c65 as a full replacement for py65mon - for example its monitor functionality for examining/changing registers and memory etc is great for debugging. More like an optional add-on which can streamline your workflow, and helpful if you want to play more with block devices.

In my own workflow c65 is great for speeding up iteration experimenting with tali source: super fast to run all tests and get a quick "all success?" indicator. I can always run same tests slowly if I need more granularity. I also find it very useful for my forth dev cycle where I can ingest a large volume of forth code v quickly and experiment with new changes. That gets painful in py65. Also I find the block device useful for loading/dumping code or memory easily from forth.

For these purposes I don't mind the input duplication, and i'm happy to have the line-editing while i'm experimenting. Since it's a tty thing it doesn't affect batch execution when I pipe input into c65 so I find that's find for checking test output and so forth.

If you think the input duplication is important I could take a look at an option to bypass the terminal line mode; i have access to a windows box so probably doable.

For testing I just quickly hacked up the ctests target. Currently it concats all tests and runs in a single c65/taliforth session but would obviously be simple to have it loop over tests and run a separate session for each group, like talitest is doing.

SamCoVT commented 6 months ago

That sounds fine. Let's plan on leaving make sim using py65mon, make tests using py65mon, and make ctests to use c65 (having make attempt to compile c65 first, if it does not exist). We will need to check for the Windows platform to determine if c65 is named c65 or c65.exe, but I don't think that's problematic.

If we can get c65 to where the output of make tests and make ctests is equivalent, then I have no issues having it as an alternate test platform. You can look at ptests.sh (linux only, but probably works fine on OSX and WSL on Windows as well) to see how I was running the tests in parallel and then collecting the results. c65 is so fast that I see no need to run tests in parallel, but it would be nice if we could make sure it works natively on Windows as well.

If you are interested, I think it's only a medium amount of work to get feature parity with py65mon, at which point we could consider removing py65mon as a requirement for Tali2 and adding a C compiler as a requirement instead. For Windows folk, we could either give cygwin instructions or WSL instructions. The default Ubuntu that Microsoft installed when you turn on WSL might already have make and gcc, in which case that would be the better option - I just installed it on one of my machines, so I'll look into that. At this point, make sim could use c65 and offer simulated block storage - and I'd consider moving the code for block-ramdrive-init into the example_forth directory to reduce Tali's size by over 300 bytes.

Are you interested in going this route? I can help with the I/O. Most of the python code I wrote for py65mon is actually just calling the underlying C functions, so much of that is reusable here. Also, if you are interested in going this route, do you want to set up c65 as a separate project or would you rather leave it here as part of Tali2? If it's here as part of Tali, then we would only need to support it for use with Tali, which might make things a bit easier.

If the remaining utilities were rewritten in C, we could also remove the python requirement altogether. I'm not adverse to that.

I don't have a full idea of exactly what I need py65mon to do to get make ctests equivalent to make tests, but basically I just want something that will work on Linux, OSX, and Windows. I think you are most of the way there already with a solution that works on OSX and Linux. I don't really care how it's handled, so you're welcome to check the OS and have different behavior or to try to get something that works on all three platforms.

patricksurry commented 5 months ago

i will probably poke around with this in the next couple of weeks but might not get to it right away

patricksurry commented 5 months ago

@SamCoVT here's a first pass at non-blocking, unbuffered IO for c65. I haven't done extensive testing but it builds and seems to work on my mac without duplicated text or mangling the terminal. I also ssh'd to my windows 10 box and used wsl + ubuntu to build and run there which also seems to work without changes (somewhat to my surprise; see c65/README). Not sure if that's what you intended or looking for a native windows exe.

lmk and we can figure out what makes sense next

SamCoVT commented 5 months ago

Works on Linux as well. I think there is enough functionality now (ability to load binary at arbitrary location and start running at arbitrary location) to run the Dorman suite if I enable the I/O. I'm not sure when I'll have time to do that. It always seems to take me multiple tries to get it assembled and running properly.

patricksurry commented 5 months ago

One extra thing I did here since it's easy with the new IO was add a non-blocking peekc location at $f005, so people could fool around with KEY? if they wanted. I changed to just one address to move the whole IO block since I've never needed to move individual magic locations on their own, but I do move the whole block. See updated README.

: foo begin ." ." 30000 0 do loop $f005 @ ?dup until ; redefined foo  ok
foo ............................................................................................... ok
.s <1> 243  ok

patricksurry commented 5 months ago

also just for amusement here's a proof of concept replacing py65 with c65 within talitest.py git diff --ignore-all-space results.txt only shows header/footer and the cycle disagreements, along with dos v unix line ending I think.

The tick count at the end claims 147.5M cycles for all tests in 1.1s => 134MHz ? I think I had a '286 machine @ 133Mhz back in the day :-)

% time python talitest_c65.py

================================================================================
Summary for: core_a core_b core_c string double facility ed asm tali tools block search user cycles
Tali Forth 2 ran all tests requested
All available tests passed
python talitest_c65.py  1.00s user 0.16s system 106% cpu 1.095 total

tail results.txt
...
5            ' 0<            cycle_test drop       CYCLES:     47 ok
5            ' 0<>           cycle_test drop       CYCLES:     48 ok
  ok
$ff $f010 ! c65: PC=a142 A=ff X=74 Y=01 S=f9 FLAGS=<N1 V0 B0 D0 I1 Z0 C0> ticks=147515561

SamCoVT commented 5 months ago

This looks really good. I'm not sure if I'll have time this weekend to get the Dorman tests to run, but that's the last piece before I'm interested in adding this as a complete replacement for the current test system. We will also want the Makefile to build c65 if it doesn't exist, which is slightly complicated by Windows wanting a .exe extension on the end (does compiling under WSL result in an extensionless binary? I don't actually know).

patricksurry commented 5 months ago

ya, no extension in wsl, the existing c65 makefile worked unchanged. so top-level should just need to try a sub-make in the c65 folder

On Fri, Apr 5, 2024 at 11:58 AM SamCoVT @.***> wrote:

This looks really good. I'm not sure if I'll have time this weekend to get the Dorman tests to run, but that's the last piece before I'm interested in adding this as a complete replacement for the current test system. We will also want the Makefile to build c65 if it doesn't exist, which is slightly complicated by Windows wanting a .exe extension on the end (does compiling under WSL result in an extensionless binary? I don't actually know).

— Reply to this email directly, view it on GitHub https://github.com/SamCoVT/TaliForth2/pull/37#issuecomment-2040162071, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA5MKXLK7TMXCUHJ6PFC3LY33C37AVCNFSM6AAAAABEL4X7XCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQGE3DEMBXGE . You are receiving this because you authored the thread.Message ID: @.***>

SamCoVT commented 5 months ago

That's good news. Recommending WSL for windows users who need to run tests is probably the easiest way to get GNU Make and a C compiler on a windows box. Those who have installed make and gcc natively on Windows and have them working from a command prompt can probably also handle adjusting the Makefiles as needed.

SamCoVT / TaliForth2

Add c65 sub for py65 with blockio + make ctests (> 100x faster on current test suite) #37