Erroneous Amstrad CPC BASIC programs conversion

mgesteiro commented 1 year ago

Hello,

First of all, thanks for the project and all the effort: it's been very valuable.

The BASIC programs conversion for the Amstrad CPC platform is not working properly: a program created in the CPC is incorrectly detokenised.

At the beginning it just seemed like a small issue in the token array table and I was about to send a PR, but then I discovered also -at least- a couple more issues:

a ":" character is inserted before an "ELSE"
a couple of 0x00 chars are added before variables names (invisible to the "naked eye")

Here is the BASIC program in case you want to check it:

"14000A00C5204349465241444F20434553415200060014008A000C001E00030000D2EF22220015002800A32022544558544F3A20222C030000D40012003200030000D4EFFF1C28030000D4290018003C009E200D0000E9EF0F20EC20FF0E28030000D429001C0046000D0000C3EFFF0128AC28030000D42C0D0000E92C0F29290044005000A1200D0000C3EE194020FA200D0000C3F1195B20EB200D0000CEEF2828280D0000C3F5194129F4190D29FB20191A29F41941200197200D0000CEEF0D0000C30017005A00030000D2EF030000D2F4FF03280D0000CE29000B006400B0200D0000E9000B006E00BF20030000D2000A007800A0201E1E00"

BTW, I checked this against the output produced by the JavaCPC similar feature, and it works fine there.

Let me know if I can be of any help.

Best regards!

eightbitjim commented 1 year ago

Hi! Thanks for taking the time to look into this and posting the issue. Much appreciated. I'll take a look and push an update once I think it's working. The example file you provided will be useful.

mgesteiro commented 1 year ago

You're wellcome 👌

This is the fixed token array: it only had 126 elements because the "<>" was missing (and an unnecessary last empty byte)

public static String [] lowTokenArray = {
      "AFTER", "AUTO", "BORDER", "CALL", "CAT", "CHAIN", "CLEAR", "CLG", "CLOSEIN", "CLOSEOUT",
            "CLS", "CONT", "DATA", "DEF", "DEFINT", "DEFREAL", "DEFSTR", "DEG", "DELETE", "DIM", "DRAW",
            "DRAWR", "EDIT", "ELSE", "END", "ENT", "ENV", "ERASE", "ERROR", "EVERY", "FOR", "GOSUB", "GOTO",
            "IF", "INK", "INPUT", "KEY", "LET", "LINE", "LIST", "LOAD", "LOCATE", "MEMORY", "MERGE", "MID$",
            "MODE", "MOVE", "MOVER", "NEXT", "NEW", "ON", "ON BREAK", "ON ERROR GOTO", "SQ", "OPENIN", "OPENOUT",
            "ORIGIN", "OUT", "PAPER", "PEN", "PLOT", "PLOTR", "POKE", "PRINT", "'", "RAD", "RANDOMIZE", "READ",
            "RELEASE", "REM", "RENUM", "RESOTRE", "RESUME", "RETURN", "RUN", "SAVE", "SOUND", "SPEED", "STOP",
            "SYMBOL", "TAG", "TAGOFF", "TROFF", "TRON", "WAIT", "WEND", "WHILE", "WIDTH", "WINDOW", "WRITE",
            "ZONE", "DI", "EI", "FILL", "GRAPHICS", "MASK", "FRAME", "CURSOR", "", "ERL", "FN", "SPC",
            "STEP", "SWAP", "", "", "TAB", "THEN", "TO", "USING", ">", "=", ">=", "<", "<>", "<=",
            "+", "-", "*", "/", "^", "\\", "AND", "MOD", "OR", "XOR", "NOT", ""
    };

I'll leave to you the other required fixes 😅

If it helps, I've been using this for comparison/checking: https://sourceforge.net/p/javacpc/svn/HEAD/tree/JavaCPC_Desktop/src/jemu/system/cpc/Baslist.java

Best regards

eightbitjim commented 1 year ago

I've incorporated the updated token array, and also added suppression of the colon : before an ELSE.

The 0-byte characters appear to have been an incorrect decoding of variable names and string literals (particularly detecting the end of a string literal).

Here's a dump of the decoded file, hopefully that's a bit better:

00000000   31 30 20 52  45 4D 20 43  49 46 52 41  44 4F 20 43  45 53 41 52  0A 32 30 20  43 4C 53 0A  33 30 20 52  10 REM CIFRADO CESAR.20 CLS.30 R
00000020   24 3D 22 22  0A 34 30 20  49 4E 50 55  54 20 22 54  45 58 54 4F  3A 20 22 2C  54 24 0A 35  30 20 54 24  $="".40 INPUT "TEXTO: ",T$.50 T$
00000040   3D 55 50 50  45 52 24 28  54 24 29 0A  36 30 20 46  4F 52 20 69  3D 31 20 54  4F 20 4C 45  4E 28 54 24  =UPPER$(T$).60 FOR i=1 TO LEN(T$
00000060   29 0A 37 30  20 43 3D 41  53 43 28 4D  49 44 24 28  54 24 2C 69  2C 31 29 29  0A 38 30 20  49 46 20 43  ).70 C=ASC(MID$(T$,i,1)).80 IF C
00000080   3E 36 34 20  41 4E 44 20  43 3C 39 31  20 54 48 45  4E 20 4E 3D  28 28 28 43  2D 36 35 29  2B 31 33 29  >64 AND C<91 THEN N=(((C-65)+13)
000000A0   4D 4F 44 20  32 36 29 2B  36 35 20 45  4C 53 45 20  4E 3D 43 0A  39 30 20 52  24 3D 52 24  2B 43 48 52  MOD 26)+65 ELSE N=C.90 R$=R$+CHR
000000C0   24 28 4E 29  0A 31 30 30  20 4E 45 58  54 20 69 0A  31 31 30 20  50 52 49 4E  54 20 52 24  0A 31 32 30  $(N).100 NEXT i.110 PRINT R$.120
000000E0   20 47 4F 54  4F 20 33 30  0A                                                                             GOTO 30.

Making the changes has exposed that my Amstrad BASIC decoding is quite messy, split between Java classes in a way that isn't particularly helpful. So, I'll need to do a bit of refactoring before it's ready to push back again, and test with some more BASIC programs (I have a tape!). Will aim to do this at the weekend :-)

eightbitjim commented 1 year ago

An updated version is currently pushed into a branch. It's still not 100% complete, but seems to be producing much better results. Changes are:

Updated token table
(Hopefully) fixed variable name decoding
Detection of end of string literal
Stopped including the 2 checksum bytes after each file section, which were causing additional corruption
Some refactoring of Amstrad BASIC decoding

test with a bit more data
do a bit more research into the BASIC format to identify any remaining changes needed

mgesteiro commented 1 year ago

Thanks for the effort: I'll be monitoring the progression 👌

This is the only guide I could find about Amstrad CPC detokenisation: https://www.cpcwiki.eu/index.php/Technical_information_about_Locomotive_BASIC#Structure_of_a_BASIC_program

Which one(s) are you using?

Best regards

mgesteiro commented 1 year ago

Hi @eightbitjim,

I tried today the new branch (47_erroneous_amstrad_cpc_decoding_candidate) and I get a lot of garbage on the output and a completely invalid result, much worse than before: am I missing something here?

10 REM CIFRADO CESAR
20 CLS
30 $
8943 "
40 INPUT "TEXTO: ",$
4608 
3 
7423 ($
41 10
8350 
4079  TO LEN($
41 &4600
49920 =ASC(MID$($
3372 
3884 ))
80 IF 
6638 @ AND 
6641 [ THEN 
10479 ((
6645 A)+13)MOD 26)+65 ELSE 
3567 
5888 
3 
3 
1023 (
41 
8368 
2816 
8383 $
2560 
8352 30

My test code, just in case:

byte[] data = HexFormat.of().parseHex("14000A00C5204349465241444F20434553415200060014008A000C001E00030000D2EF22220015002800A32022544558544F3A20222C030000D40012003200030000D4EFFF1C28030000D4290018003C009E200D0000E9EF0F20EC20FF0E28030000D429001C0046000D0000C3EFFF0128AC28030000D42C0D0000E92C0F29290044005000A1200D0000C3EE194020FA200D0000C3F1195B20EB200D0000CEEF2828280D0000C3F5194129F4190D29FB20191A29F41941200197200D0000CEEF0D0000C30017005A00030000D2EF030000D2F4FF03280D0000CE29000B006400B0200D0000E9000B006E00BF20030000D2000A007800A0201E1E00");
AmstradBasic ambasic = new AmstradBasic(data, "default");
System.out.println(ambasic);

Best regards

eightbitjim commented 1 year ago

Hi @mgesteiro

Oh no, that's much worse than before! I reckon I had your caesar cipher program displaying apparently correctly based on the earlier data, but I was feeding the bytes into the code in a different way to your example (thanks for sharing the test code). Leave it with me, I'll take another look.

I can't remember exactly what I used as my BASIC reference source originally (it was several years ago), although according to my notes I was using these two books:

The Amstrad CPC-464 Advanced User Guide, by Mark Harrison Amstrad CPCs Advanced Users Guide Book 1, by Daniel Martin

The wiki link is useful, I'll look at that too. I never had an Amstrad, so am only recently starting to learn about them! :-)

eightbitjim commented 1 year ago

Just. quick question if that's ok, about the sequence of bytes you've provided as input. Did you get this from this latest version of the program or the earlier one, or from another source?

One of the changes that I made was that I was (erroneously) including in the output data stream the checksum bytes that periodically occur. So this was causing corruption when they hit bas BASIC detokeniser as it isn't expecting to see these details. I've removed the checksum bytes now at the earlier stage where it parses the file block format (although I still need to implement verification against the checksums). When I tested it, I started again from WAV files and got decent results.

I'll step through tomorrow and see exactly what it's doing with this data. If you have the original audio files then it might be worth feeding them through again to re-generate the binary data without the checksums, but no need as I'll manually remove them from the data and see if that improves the decoding.

Hope that makes sense, and thanks for your help on this!

mgesteiro commented 1 year ago

hi,

about the sequence of bytes you've provided as input. Did you get this from this latest version of the program or the earlier one, or from another source?

The hex data I provided was directly from an audio tape that I recorded (in real hardware), processed and extracted myself: it's the pure and clean data payload of the 2nd part of the block (1st part = header, 2nd part = data), without any previous block type identifier or posterior checksum and trailer. It's just the basic program, clean.

I can provide the wav file if you want, but your tools fail to process it correctly: I planned to ask you about this also... later 😅 (and because I have that part already covered myself...). I just needed a detokeniser and yours seemed a good option (although being in JAVA)

Regards!

eightbitjim commented 1 year ago

Ah, I understand, thanks for info. I'll look at exactly where it's going wrong with the detokenisation of your file, although it might not be until the weekend I can start to make further updates.

Great to hear that you've written the earlier decoding bits. Regards :-)

eightbitjim commented 1 year ago

Hi,

I'm currently not able to reproduce the behaviour you're seeing. I tried cloning a new branch from the repo (branch origin/47_erroneous_amstrad_cpc_decoding_candidate) and then added a test file alongside the Basic decoding classes in Platforms/Amstrad/Format. I pasted in your example code.

I needed to move to JDK 17, as the code is quite old and compiled for JDK8, which doesn't seem to have HexFormat. But once on JDK 17, this is what I get when I run the test class:

% java -classpath ./target/classes com.eightbitjim.cassettenibbler.Platforms.Amstrad.Formats.TestFormat

10 REM CIFRADO CESAR
20 CLS
30 R$=""
40 INPUT "TEXTO: ",T$
50 T$=UPPER$(T$)
60 FOR i=1 TO LEN(T$)
70 C=ASC(MID$(T$,i,1))
80 IF C>64 AND C<91 THEN N=(((C-65)+13)MOD 26)+65 ELSE N=C
90 R$=R$+CHR$(N)
100 NEXT i
110 PRINT R$
120 GOTO 30

Here's the test class:

package com.eightbitjim.cassettenibbler.Platforms.Amstrad.Formats;

import java.util.HexFormat;

public class TestFormat {
    public static final void main(String args[]) {
        byte [] data = HexFormat.of().parseHex("14000A00C5204349465241444F20434553415200060014008A000C001E00030000D2EF22220015002800A32022544558544F3A20222C030000D40012003200030000D4EFFF1C28030000D4290018003C009E200D0000E9EF0F20EC20FF0E28030000D429001C0046000D0000C3EFFF0128AC28030000D42C0D0000E92C0F29290044005000A1200D0000C3EE194020FA200D0000C3F1195B20EB200D0000CEEF2828280D0000C3F5194129F4190D29FB20191A29F41941200197200D0000CEEF0D0000C30017005A00030000D2EF030000D2F4FF03280D0000CE29000B006400B0200D0000E9000B006E00BF20030000D2000A007800A0201E1E00");
        AmstradBasic ambasic = new AmstradBasic(data, "default");
        System.out.println(ambasic);
    }    
}

So I'm not sure why we're seeing something different. I'll think further about it. I'm using OpenJDK 17.

Regards!

mgesteiro commented 1 year ago

Hi @eightbitjim,

Sorry, I messed up: my IDE (Eclipse), was giving me some warnings about The value of the local variable xxx is not used and I corrected them (without paying enough attention) before even running the code once... I tried again without touching anything and it worked.

After a more calm revision I discovered that those variables were populated by calling some methods that read from the stream: when I commented out the lines I changed by accident the logic because although the return value (and the variables) were NOT used, the methods needed to be called for that consumption to take place... 🤦‍♂️

Here is the diff for my revised version of the class in question in case you want to check the modifications:

--- /cassette-nibbler-47_erroneous_amstrad_cpc_decoding_candidate/src/main/java/com/eightbitjim/cassettenibbler/Platforms/Amstrad/Formats/Line.java
+++ /NibblerByteToBasic/src/Line.java
@@ -25,7 +25,7 @@
     private InputStream inputStream;
     private StringBuilder builder;
     private boolean endOfLine = false;
-    private boolean insideRemStatement = false;
+    //private boolean insideRemStatement = false;
     private boolean insideQuote = false;

     private static final int END_OF_LINE_MARKER = 0;
@@ -95,7 +95,7 @@
                 // Don't output a colon until we see the next token, as we don't want
                 // to display colons in front of an ELSE
                 // But this will terminate a REM statement
-                insideRemStatement = false;
+                //insideRemStatement = false;
                 return;
             case 2:
             case 3:
@@ -157,7 +157,7 @@
                 processApostrophe();
                 return;
             case 197: // REM statement
-                insideRemStatement = true;
+                //insideRemStatement = true;
                 break;
             case 245:
                 processMinus();
@@ -234,13 +234,15 @@
     }

     private void processRSXCommand() throws IOException {
-        int valueOffsetToIgnore = getByteValue();
+        //int valueOffsetToIgnore = 
+        getByteValue();
         String commandName = getStringFromInput();
         builder.append("|").append(commandName);
     }

     private void processVariableWithTokenNoSuffix() throws IOException {
-        int offsetToIgnore = getTwoByteValue();
+        //int offsetToIgnore = 
+        getTwoByteValue();
         String variableName = getStringFromInput();
         builder.append(variableName);
     }

Thanks for your patience!

eightbitjim commented 1 year ago

Hi,

No problem, I'm really glad it's working now! I'll add some comments to explain what's going on (and probably just get rid of the unused variables), and also have a look at the insideRemStatement variable, as I must have put it there for a reason but somehow didn't use it.

Once I've gone through a bit more test data, I'll merge it with the main branch.

mgesteiro commented 1 year ago

Thanks for all the effort!

eightbitjim / cassette-nibbler

Erroneous Amstrad CPC BASIC programs conversion #47