Some questions regarding porting Chez Scheme to RISC-V

I'm trying to port CS to RISC-V, by now I've finished most of the primitives, here are some problems I encountered:

(asm-enter) is simply (values) in arm32.ss and ppc32.ss, but in x86.ss and x86_64.ss it adjusts %sp. Why the difference?
Comparison after locks: I noticed that after (%inline locked-decr!) and (%inline locked-incr!) there is info-condition-code, and (%inline lock!) is wrapped in an if-expression, so I have to set the condition code in these three primitives' assembler section? But RISC-V does not have condition flags, so currently I'm using one specific register for condition flags, and write the result of all carry/ovfl and comparisons in this register, and conditional branches are all based on this register. Don't know if it's OK?

What does asm-kill do? It's just

(define asm-kill
(lambda (code* dest)
  code*))

Do I have to worried about the length of the data to be stored when src is `reg in asm-move?
The pause primitive uses pause instruction on x86, isync on ppc, nothing on arm. On RISC-V fence and fence.i seem to satisfy this?
Since RISC-V does not have so many addressing modes, is it OK to simply use ur and some immediate in define-instruction and rely on coerce-opnd to handle all other types of input?
Regarding addresses, I found that the backends either use a register or immediate field, but not both, don't know if it's always so?
Regarding ffi, in x86_64.ss, (asm-foreign-call), we have fill-result-here? as the output of (result-fits-in-registers?), which works on result type. But when it's #t, the 1st argument is stored on stack, and from the context I suppose it's a pointer, because in (add-fill-result) after the c-call finishes, the return value is stored into memory addressed by the pointer. So now the 2nd argument is put into the 1st argument register. How is the C function going to deal with this? In (asm-foreign-callable), fill-result-here? becomes synthesize-first?, the behavior is similar. I don't quite understand the logic behind this.

For background on x86 and x86_64 support vs. arm32 and ppc32: around version 9 of Chez Scheme (plus version 8 starting from 8.9.5, if memory serves), it was rewritten to use the nanopass framework, and the authors at the time decided that x86 and x86_64 support would cover most use cases. With that decision, the first backends using the new compiler are somewhat opinionated toward x86_64. The arm32 backend wasn't written until a few years later around 2014, and even though ARM has its own quirks, the backend retained a number of the x86_64 opinions where they weren't too troublesome.

I haven't worked with RISC-V, so I don't know best practices for some of your questions, but I know the arm32 and ppc32 backends pretty well, so hopefully I can provide some justifications for why they are the way they are.

(asm-enter) is simply (values) in arm32.ss and ppc32.ss, but in x86.ss and x86_64.ss it adjusts %sp. Why the difference?

This appears to be an artefact of x86/x86_64 machine constraints around stack alignment. It was likely simpler to essentially stub the procedure for the RISC platforms than lift the conditional into cpnanopass.ss.

Comparison after locks: I noticed that after (%inline locked-decr!) and (%inline locked-incr!) there is info-condition-code, and (%inline lock!) is wrapped in an if-expression, so I have to set the condition code in these three primitives' assembler section? But RISC-V does not have condition flags, so currently I'm using one specific register for condition flags, and write the result of all carry/ovfl and comparisons in this register, and conditional branches are all based on this register. Don't know if it's OK?

I think there's more than one question here, so hopefully I can address them. For the %inline expressions you're referencing, it seems that those happen in a pass of the compiler prior to instructions selection, so the if that you see in reference to those primitives is not the same type as the if that the RISC-V backend needs to handle.

For the condition flags, I think you can accomplish creating a RISC-V backend with a reserved register for condition codes, but I think it's probably unnecessary, and likely inadvisable. Again, I'm not experienced with RISC-V, so I'm speaking from the Chez Scheme end. Reserving a register for condition codes makes it unavailable to the register allocator, and will likely lead to less optimal code.

Based on my 5 minutes of Googling RISC-V conditional branches, it looks like you'll need to formulate conditional code in terms of one of the conditional jump instructions. Throughout the backend files, you'll see calls to make-tmp--this will essentially reserve a register for the scope of that instruction. We call such reserved registers "unspillable," which refers to how the graph coloring register algorithm works, in that that location can't be moved to the stack (or spilled) within the context of that instruction (hence the u prefix for the let bindings of those variables). I would recommend using make-tmp for instructions that use condition codes, which will likely allow the register allocator to make better choices.

What does asm-kill do? It's just
  (define asm-kill
    (lambda (code* dest)
      code*))

This again has to do with the graph coloring register allocator. I started to put together a more full explanation, but for now, I'll have to leave the explanation at this: the compiler uses asm-kill in places where a register is going to be used in a library call such that it wouldn't be detected in the analysis that determines which registers are available for assignment.

Do I have to worried about the length of the data to be stored when src is `reg in asm-move?

I'm not sure I understand this question, mostly because I'm not sure if what you mean when you say "length" is what I would call "width." If that's the case, then generally speaking, the compiler should take care of fitting things into the width of the platform's registers based on the <machine-type>.def file.

The pause primitive uses pause instruction on x86, isync on ppc, nothing on arm. On RISC-V fence and fence.i seem to satisfy this?

The arm32 backend was written targeting primarily ARMv7 platforms (ARMv8 was in draft, IIRC), and at the time, we didn't find a useful instruction to use for pause. ARMv8 has a yield instruction that seems to fit the bill, but basically anything that's a hint to the processor to say "the current thread needs to wait and likely isn't going to do anything useful" should work. Or else do nothing, a la arm32. I spent a minute or so looking at fence for RISC-V, and it probably works, but I did see some wording that possibly implied some synchronization that may be unnecessary.

Since RISC-V does not have so many addressing modes, is it OK to simply use ur and some immediate in define-instruction and rely on coerce-opnd to handle all other types of input?

Like with the case of reserving a register for condition codes, this would probably work, but will result in less optimal code. That is in fact a strategy I've used for bootstrapping on new systems just to get things running, but I add smarter cases later. Chez Scheme has traditionally not used a peephole optimizer, as I understand it in part because of the way define-instruction allows for specifying special cases that help generate better code sequences ("better" meaning shorter or faster).

Regarding addresses, I found that the backends either use a register or immediate field, but not both, don't know if it's always so?

I'm unsure about this one. My memory is that ARMv6 and ARMv7 (possibly ARMv8) support only one or the other, and not both, so that's why the backend for it is written that way. I don't remember if or why that's the case for ppc32, but it's possible that nobody removed the restriction from arm32, plus it seemed to be that way in the x86 backends, anyway. I don't know why the x86 platforms don't use mixed addressing modes, though.

Regarding ffi, in x86_64.ss, (asm-foreign-call), we have fill-result-here? as the output of (result-fits-in-registers?), which works on result type. But when it's #t, the 1st argument is stored on stack, and from the context I suppose it's a pointer, because in (add-fill-result) after the c-call finishes, the return value is stored into memory addressed by the pointer. So now the 2nd argument is put into the 1st argument register. How is the C function going to deal with this? In (asm-foreign-callable), fill-result-here? becomes synthesize-first?, the behavior is similar. I don't quite understand the logic behind this.

I spent quite a bit of time staring at this, too, but I figured it out. To start, this is explained in the documentation for the (& ftype-name) return type for foreign procedures in section 4.2 of the user's guide:

(& ftype-name): The result is interpreted as a foreign object whose structure is described by the ftype identified by ftype-name, where the foreign procedure returns a ftype-name result, but the caller must provide an extra (* ftype-name) argument before all other arguments to receive the result. An unspecified Scheme object is returned when the foreign procedure is called, since the result is instead written into storage referenced by the extra argument. The ftype-name cannot refer to an array type.'

In asm-foreign-call, and out of the context of an actual foreign call, this looks like the compiler munging the arguments to the C function. In fact, the C function isn't expecting that first argument at all--it's for the Scheme runtime's C code to use for returning an ftype object from C back to Scheme. There are some other hints and references to this extra first argument in the definition of $make-foreign-procedure in syntax.ss.

I hope that helps answer most of your questions, except where my memory has failed or justifications have been lost to time.

For background on x86 and x86_64 support vs. arm32 and ppc32: around version 9 of Chez Scheme (plus version 8 starting from 8.9.5, if memory serves), it was rewritten to use the nanopass framework, and the authors at the time decided that x86 and x86_64 support would cover most use cases. With that decision, the first backends using the new compiler are somewhat opinionated toward x86_64. The arm32 backend wasn't written until a few years later around 2014, and even though ARM has its own quirks, the backend retained a number of the x86_64 opinions where they weren't too troublesome.

I haven't worked with RISC-V, so I don't know best practices for some of your questions, but I know the arm32 and ppc32 backends pretty well, so hopefully I can provide some justifications for why they are the way they are.

(asm-enter) is simply (values) in arm32.ss and ppc32.ss, but in x86.ss and x86_64.ss it adjusts %sp. Why the difference?

This appears to be an artefact of x86/x86_64 machine constraints around stack alignment. It was likely simpler to essentially stub the procedure for the RISC platforms than lift the conditional into cpnanopass.ss.

Comparison after locks: I noticed that after (%inline locked-decr!) and (%inline locked-incr!) there is info-condition-code, and (%inline lock!) is wrapped in an if-expression, so I have to set the condition code in these three primitives' assembler section? But RISC-V does not have condition flags, so currently I'm using one specific register for condition flags, and write the result of all carry/ovfl and comparisons in this register, and conditional branches are all based on this register. Don't know if it's OK?

I think there's more than one question here, so hopefully I can address them. For the %inline expressions you're referencing, it seems that those happen in a pass of the compiler prior to instructions selection, so the if that you see in reference to those primitives is not the same type as the if that the RISC-V backend needs to handle.

For the condition flags, I think you can accomplish creating a RISC-V backend with a reserved register for condition codes, but I think it's probably unnecessary, and likely inadvisable. Again, I'm not experienced with RISC-V, so I'm speaking from the Chez Scheme end. Reserving a register for condition codes makes it unavailable to the register allocator, and will likely lead to less optimal code.

Based on my 5 minutes of Googling RISC-V conditional branches, it looks like you'll need to formulate conditional code in terms of one of the conditional jump instructions. Throughout the backend files, you'll see calls to make-tmp--this will essentially reserve a register for the scope of that instruction. We call such reserved registers "unspillable," which refers to how the graph coloring register algorithm works, in that that location can't be moved to the stack (or spilled) within the context of that instruction (hence the u prefix for the let bindings of those variables). I would recommend using make-tmp for instructions that use condition codes, which will likely allow the register allocator to make better choices.
What does asm-kill do? It's just
  (define asm-kill
    (lambda (code* dest)
      code*))
This again has to do with the graph coloring register allocator. I started to put together a more full explanation, but for now, I'll have to leave the explanation at this: the compiler uses asm-kill in places where a register is going to be used in a library call such that it wouldn't be detected in the analysis that determines which registers are available for assignment.

Do I have to worried about the length of the data to be stored when src is `reg in asm-move?

I'm not sure I understand this question, mostly because I'm not sure if what you mean when you say "length" is what I would call "width." If that's the case, then generally speaking, the compiler should take care of fitting things into the width of the platform's registers based on the <machine-type>.def file.

The pause primitive uses pause instruction on x86, isync on ppc, nothing on arm. On RISC-V fence and fence.i seem to satisfy this?

The arm32 backend was written targeting primarily ARMv7 platforms (ARMv8 was in draft, IIRC), and at the time, we didn't find a useful instruction to use for pause. ARMv8 has a yield instruction that seems to fit the bill, but basically anything that's a hint to the processor to say "the current thread needs to wait and likely isn't going to do anything useful" should work. Or else do nothing, a la arm32. I spent a minute or so looking at fence for RISC-V, and it probably works, but I did see some wording that possibly implied some synchronization that may be unnecessary.

Since RISC-V does not have so many addressing modes, is it OK to simply use ur and some immediate in define-instruction and rely on coerce-opnd to handle all other types of input?

Like with the case of reserving a register for condition codes, this would probably work, but will result in less optimal code. That is in fact a strategy I've used for bootstrapping on new systems just to get things running, but I add smarter cases later. Chez Scheme has traditionally not used a peephole optimizer, as I understand it in part because of the way define-instruction allows for specifying special cases that help generate better code sequences ("better" meaning shorter or faster).

Regarding addresses, I found that the backends either use a register or immediate field, but not both, don't know if it's always so?

I'm unsure about this one. My memory is that ARMv6 and ARMv7 (possibly ARMv8) support only one or the other, and not both, so that's why the backend for it is written that way. I don't remember if or why that's the case for ppc32, but it's possible that nobody removed the restriction from arm32, plus it seemed to be that way in the x86 backends, anyway. I don't know why the x86 platforms don't use mixed addressing modes, though.

Regarding ffi, in x86_64.ss, (asm-foreign-call), we have fill-result-here? as the output of (result-fits-in-registers?), which works on result type. But when it's #t, the 1st argument is stored on stack, and from the context I suppose it's a pointer, because in (add-fill-result) after the c-call finishes, the return value is stored into memory addressed by the pointer. So now the 2nd argument is put into the 1st argument register. How is the C function going to deal with this? In (asm-foreign-callable), fill-result-here? becomes synthesize-first?, the behavior is similar. I don't quite understand the logic behind this.

I spent quite a bit of time staring at this, too, but I figured it out. To start, this is explained in the documentation for the (& ftype-name) return type for foreign procedures in section 4.2 of the user's guide:

(& ftype-name): The result is interpreted as a foreign object whose structure is described by the ftype identified by ftype-name, where the foreign procedure returns a ftype-name result, but the caller must provide an extra (* ftype-name) argument before all other arguments to receive the result. An unspecified Scheme object is returned when the foreign procedure is called, since the result is instead written into storage referenced by the extra argument. The ftype-name cannot refer to an array type.'

In asm-foreign-call, and out of the context of an actual foreign call, this looks like the compiler munging the arguments to the C function. In fact, the C function isn't expecting that first argument at all--it's for the Scheme runtime's C code to use for returning an ftype object from C back to Scheme. There are some other hints and references to this extra first argument in the definition of $make-foreign-procedure in syntax.ss.

I hope that helps answer most of your questions, except where my memory has failed or justifications have been lost to time.

Thanks for your answers, now I have succeeded in compiling the compiler, though ffi is not working(since I copied that from x86_64.ss and ABI logic needs some change) and some errors occur when the cross-compiler is trying to compile files in examples/, I list the files with the errors they result in below:

most common:
Exception in car: () is not a pair

fact.ss fatfib.ss fft.ss power.ss
Exception: failed assertion (null? unspillable*) at line 15413, char 32 of cpnanopass.ss

edit.ss unify.ss
Exception in bitwise-arithmetic-shift-left: #f is not an exact integer
freq.ss

Exception in car: riscv64-call is not a pair
queue.ss ez-grammar-test.ss
Exception in compiler-internal: find-home!: spilled unspillable #{ura g57a89gqrhvjqoqm0f1zac3s4-1}

This is quite tricky, I wonder if you guys have encountered these before when porting Chez Scheme?

Update: I inserted a bunch of (printf)s in the backend and in cpnanopass.ss, result is that the error occurred after select-instruction! pass. Still looking for errors in the instruction definitions...

For the <foo> is not a pair and #f is not an exact integer errors, those can occur when the ABI is incorrect, which you said is likely since it's copied from x86_64.ss.

For the failed (null? unspillable*) assertion and Exception in compiler-internal, that indicates the register allocator is overly constrained. That can occur when there aren't enough registers listed in section 1 of the <backend>.ss file, or too many calls to make-tmp in section 2. It looks like RISC-V has 31 general purpose registers, which should be more than enough. I believe I've seen this happen before, but unfortunately I don't remember the specific cause or solution.

For the <foo> is not a pair and #f is not an exact integer errors, those can occur when the ABI is incorrect, which you said is likely since it's copied from x86_64.ss.

For the failed (null? unspillable*) assertion and Exception in compiler-internal, that indicates the register allocator is overly constrained. That can occur when there aren't enough registers listed in section 1 of the <backend>.ss file, or too many calls to make-tmp in section 2. It looks like RISC-V has 31 general purpose registers, which should be more than enough. I believe I've seen this happen before, but unfortunately I don't remember the specific cause or solution.

Thanks, now the instruction selection and register allocation can be done, but another problem occurs in the c-faslobj procedure:

Exception in c-faslcode: wrote 232 bytes, expected 216 bytes

At first I thought the error was due to the 'quad I set in asm-rp-header, so I changed it to 'long, but problem still exists, through the bytes written became a little less.

The asm-size procedure always outputs 4, except when the input is riscv4-{abs, jump, call}, just like in arm32.ss, and the emit-code procedure always constructs pairs with 'long in the car field.

What else can give rise to the extra bytes?

Unfortunately, the only advice I can come up with is to try to get a trace of what happens in the (let prf0 ...) loop in compile.ss, using either the debugger or prints. I don't have specific evidence for it, but I suspect the value for ptr-bits (defined in <machine-type>.def) might not match your chip's pointer width. If that's true, it could account for the extra bytes. However, I would be a little surprised if you didn't have other errors earlier than this one.

Unfortunately, the only advice I can come up with is to try to get a trace of what happens in the (let prf0 ...) loop in compile.ss, using either the debugger or prints. I don't have specific evidence for it, but I suspect the value for ptr-bits (defined in <machine-type>.def) might not match your chip's pointer width. If that's true, it could account for the extra bytes. However, I would be a little surprised if you didn't have other errors earlier than this one.

Indeed this has something to do with ptr-bits, though the value is right. The question is in asm-size and asm-rp-header. In the former quad and abs and code-top-link were not considered, and were made to output 4 when should be 8; in the latter I used long instead of quad in the output pair. Therefore in compile.ss bytes written are more than expected.

Now it seems the files in examples/ can all be compiled by running make boot XM=trv64le, but make still exits with error:

(time (for-each compile-file ...))
    30 collections
    0.666741002s elapsed cpu time, including 0.117468131s collecting
    0.671828592s elapsed real time, including 0.118290646s collecting
    258236432 bytes allocated, including 246627728 bytes reclaimed
> 
make[2]: *** [Mf-cross:37: xboot] Error 2
make[1]: *** [Mf-boot:22: trv64le.boot] Error 2
make: *** [Makefile:50: boot] Error 2

I searched for errors earlier in the output and found:

Exception in compile-file: compiler for trv64le is not loaded
make[3]: *** [Mf-base:552: bootall] Error 255
make[3]: *** Waiting for unfinished jobs....

saying that trv64le is not loaded. But I have set up both Mf-trv64le and Mf-rv64le, for threaded and unthreaded little-endian risc-v platform, files such as rv64le.def and trv64le.def also are right.

Another question: when running make boot XM=rv64le without the "t", the error becomes

Exception in compiler-internal: find-home!: spilled unspillable #{ura4 bzlsondw1agi216ffb3fyayk4-0}

after some debugging, the error is in asmlibcall. %ra is declared as allocable, below is the code for asmlibcall:

(define-instruction value (asmlibcall)
    [(op (z ur))
     (let ([u (make-precolored-unspillable 'ura4 %ra)])
       (if (info-asmlib-save-ra? info)
           (seq
            `(set! ,(make-live-info) ,u (asm ,null-info ,asm-kill))
            `(set! ,(make-live-info) ,z (asm ,info ,(asm-library-call (info-asmlib-libspec info) #t) ,u ,(info-kill*-live*-live* info) ...)))
           (seq
            `(set! ,(make-live-info) ,u (asm ,null-info ,asm-kill))
            `(set! ,(make-live-info) ,z (asm ,info ,(asm-library-call (info-asmlib-libspec info) #f) ,u ,(info-kill*-live*-live* info) ...)))))])

Good, now the unthreaded version can be compiled. Almost all silly errors were due to incorrect uses of make-tmp, without asm-kill.

However, the threaded version cannot get compiled. Every time the make process comes to compile examples/, it exits with error:

Exception in compile-file: compiler for trv64le is not loaded

But the unthreaded version can compile all of them. As far as I know, the only difference in the backend is that get-tc, {activate,deactivate,unactivate}-thread are used in FFI. Though the FFI code was copied from x86_64.ss, I made some changes to make sure the registers used are all RISC-V version and there are no errors in unthreaded version.

Any ideas?

Are you building trv64le in the same workarea as rv64le? I haven't built any threaded version in a while, but I believe that they're typically built separately. There may be some values in Makefiles or definition files that's causing a problem. I would suggest creating a new workarea for trv64le as a separate machine type and copying over files from rv64le piecemeal. The risc-v.ss (or whatever you've named the RISC-V backend file) should be the same, but you might need changes in other files.

Well the error was in the machine description: I forgot to change machine-type in trv64le.def, which was the same as in rv64le.def, so it says compiler for trv64le is not loaded.

Now both versions of compilers can be compiled, and I moved the project in a riscv virtual machine running on QEMU, the C runtime can be compiled, but the boot file can't be loaded, with segfault. I debugged the boot file loading process and found that the error occurs in the following way:

During the first call in scheme.c: Sbuild_heap() to load(), after S_G.error_invoke_code_object, S_G.invoke_code_object and S_G.base_rtd obtained their value, there is a while loop that reads objects in the boot file.

Now, when it loops the 3rd time(i=3), the predicate Sprocedurep(x) becomes true, then boot_call() -> S_call_help() -> S_generic_invoke(). Instructions in S_generic_invoke():

=> 0x2aaab11640 <S_generic_invoke+24>:  ld      a5,-32(s0)
     0x2aaab11644 <S_generic_invoke+28>:  addi  a5,a5,65
     0x2aaab11648 <S_generic_invoke+32>:  ld      a0,-24(s0)
     0x2aaab1164c <S_generic_invoke+36>:  jalr    a5

the jalr jumps to S_G.invoke_code_object, the assembly(riscv) from which is

=> 0x3ff7c10160:        add     s3,zero,a0
   0x3ff7c10164:        ld      a7,56(s3)
   0x3ff7c10168:        ld      a6,48(s3)
   0x3ff7c1016c:        ld      a5,40(s3)
   0x3ff7c10170:        ld      a4,32(s3)
   0x3ff7c10174:        ld      a3,24(s3)
   0x3ff7c10178:        ld      a2,16(s3)
   0x3ff7c1017c:        ld      a1,8(s3)
   0x3ff7c10180:        ld      a0,0(s3)
   0x3ff7c10184:        ld      s2,160(s3)
   0x3ff7c10188:        ld      t0,136(s3)
   0x3ff7c1018c:        ld      tp,176(s3)
   0x3ff7c10190:        ld      t3,152(s3)
   0x3ff7c10194:        addi    t3,t3,8
   0x3ff7c10198:        li      t5,8
   0x3ff7c1019c:        sub     t3,t3,t5
   0x3ff7c101a0:        auipc   t5,0x0
   0x3ff7c101a4:        addi    t5,t5,48
   0x3ff7c101a8:        sd      t5,0(t3)
   0x3ff7c101ac:        jr      3(s2)
   0x3ff7c101b0:        0x8
   0x3ff7c101b2:        unimp
   0x3ff7c101b4:        unimp
   0x3ff7c101b6:        unimp

where the lds are. if I'm right, code for restoring Scheme state. The following auipc and addi are for asm-return-address. Register s2 contains the pointer to the closure obtained back in scheme.c: load()(the x in Sprocedurep(x) above). Then, when after jr 3(s2) takes the addr of the code from the closure, it jumps to:

=> 0x3ff7c65190:        addi    a2,sp,320
     0x3ff7c65192:        bnez    a5,0x3ff7c6511a

which is garbage.

As a comparison, I GDBed the x86_64 boot file. Stll in the 3rd round in the while loop in scheme.c: load(), control passes to boot_call() -> S_call_help() -> S_generic_invoke(), the asm when control transfers to Scheme is:

=> 0x40008150:  sub    $0x8,%rsp
   0x40008154:  mov    %rdi,%r14
   0x40008157:  mov    0x10(%r14),%rsi
   0x4000815b:  mov    0x8(%r14),%rdi
   0x4000815f:          mov    (%r14),%r8
   0x40008162:  mov    0x40(%r14),%r15
   0x40008166:  mov    0x28(%r14),%rbp
   0x4000816a:  mov    0x50(%r14),%r9
   0x4000816e:  mov    0x38(%r14),%r13
   0x40008172:  add    $0x8,%r13
   0x40008176:  sub    $0x8,%r13
   0x4000817a:  lea    0x28(%rip),%rcx        # 0x400081a9
   0x40008181:  mov    %rcx,0x0(%r13)
   0x40008185:  jmp    *0x3(%r15)

almost the same. But after the last jmp, the asm is meaningful:

=> 0x40008230:  mov    $0x3e,%rbp
   0x40008237:  jmp    *0x0(%r13)

It just jumps to the addr stored in %sfp. After the jump:

=> 0x400081a9:  mov    $0x1,%r10
   0x400081b0:  mov    %r10,0x30(%r14)
   0x400081b4:  mov    %rbp,0x28(%r14)
   0x400081b8:  mov    %r9,0x50(%r14)
   0x400081bc:  mov    %r13,0x38(%r14)
   0x400081c0:  add    $0x8,%rsp
   0x400081c4:  movabs $0x5555555b6f1f,%rax
   0x400081ce:  jmp    *%rax

the absolute addr is that of S_return(), so control returns back to C. Thus the process is like:

1. S_generic_invoke() ->
2.     code for invoke_code_object, 1st jump->
3.         another jump(to called Scheme function?)->
4.             back to invoke_code_object + jump back to S_return()

So some thing's wrong in step2, in the content of reg s2. But this value is got from S_boot_read(). I check my Scheme backend, there is no code for the garbage code above(which is disassembled as 2-byte compressed instruction, which I didn't implement; normal instructions are 4 bytes). Even the addr stored in %sfp[0] is right, it contains the addr of step4... Help wanted😅 @akeep @cjfrisz

If I'm understanding this correctly, this is the first call into Scheme code from the C runtime. My guess is that you're either jumping to the address of start of the closure record instead of the code itself (i.e., not adding closure-code-disp to the address) or there's otherwise a bad offset calculation.

That's just off the top of my head, so take that with a grain of salt. I hope that it turns out to be that simple. 😅

Firstly, I see in other backends that the implementation of asm-conditional-jump uses a big macro to generate code to dispatch on different comparisons, however in the RISC-V case there's no condition flags, so I chose a specific register(say %cond) for it, and in the assemblers of <, u<, eq?, logtest, fl<, etc., I manually set %cond to 1 if the condition is true, 0 otherwise. For example:

  (define-instruction pred (logtest log!test)
    [(op (x ur) (y ur)) 
     (values '() `(asm ,info-cc-eq ,(asm-logtest (eq? op 'log!test) info-cc-eq) ,x ,y))])

and asm-logtest:

  (define asm-logtest
    (lambda (i? info)
      (lambda (l1 l2 offset x y)
        (Trivit (x y)
                (values
                 (emit and `(reg . ,%cond) x y
                       (emit sltiu `(reg . ,%cond) `(reg . ,%cond) 1 ;; set less that immediate unsigned; if the last operand is 1, cond is set to 1 iff cond is 0
                             (emit xori `(reg . ,%cond) `(reg . ,%cond) 1 '())))
                 (let-values ([(l1 l2) (if i? (values l2 l1) (values l1 l2))])
                   (asm-conditional-jump info l2 l1 offset)))))))

My assumption is the newly set %cond is immediately used by a following asm-conditional-jump. From the boot file compiled it seems my hypothesis is right:

0x3ff7c0c6ec       and     s11,s1,s11 # a logtest: (not (zero? (and s1 x11)))
0x3ff7c0c6f0        xor      t6,s11,s10
0x3ff7c0c6f4        seqz    t6,t6
0x3ff7c0c6f8        beqz    t6,0x3ff7c0c710

My understanding of the logic of big macro in asm-conditional-jump is this:

      ;;normally:
      ;; [cond jump l1]
      ;; l2:           # disp2 = 0
      ;;     ...
      ;; l1:
      ;;     ...

      ;;inverted:
      ;; [inverted cond jump l2]
      ;; l1:           # disp1 = 0
      ;;    ...
      ;; l2:
      ;;    ...

      ;; generally:
      ;; [cond jump l1]
      ;; [jmp l2]
      ;; [other instructions]
      ;; l2:
      ;;     ...
      ;; l1:
      ;;     ...

Since cond branch only depends on whether %cond is 1 or 0, I'm using just bne %cond, %zero, target and beq %cond, %zero, target for the normal and inverted case.

Secondly, now the startup halts when registering foreign entries. I put a printf() in S_foreign_entry() and the output is:

foreign entry: (cs)sqrt
foreign entry: (cs)atan2
foreign entry: (cs)atan

Breakpoint 1, S_foreign_entry () at foreign.c:243
243     ptr tc = get_thread_context();
(gdb) c
Continuing.
foreign entry: (cs)sinh

Breakpoint 2, S_handle_nonprocedure_symbol () at schsig.c:437
437     ptr tc = get_thread_context();
(gdb) c
Continuing.
Call error: (1 3 $+ +)
[Inferior 1 (process 14653) exited with code 01]

... seems it's calling + but found out that it is not a closure, so the handed coded nonprocedure-code called handle-nonprocedure-symbol. The code for nonprocedure-code is

=> 0x3ff7c0c6e0:    ld  s1,5(t4) # 5 is symbol-value-disp
   0x3ff7c0c6e4:    li  s11,7    # load imm closure mask
   0x3ff7c0c6e8:    li  s10,5    # closue type
   0x3ff7c0c6ec:    and s11,s1,s11 # 3 instr for type-check
   0x3ff7c0c6f0:    xor t6,s11,s10  
   0x3ff7c0c6f4:    seqz    t6,t6
   0x3ff7c0c6f8:    beqz    t6,0x3ff7c0c710 # the inverted cond jump
   0x3ff7c0c6fc:    mv  s4,s1
   0x3ff7c0c700:    ld  s11,3(s1)
   0x3ff7c0c704:    sd  s11,13(t4)
   0x3ff7c0c708:    ld  t5,3(s4)
   0x3ff7c0c70c:    jr  t5
   0x3ff7c0c710:    sd  a7,56(s0)  # store Scheme states and jump to handle-nonprocedure-symbol
   0x3ff7c0c714:    sd  a6,48(s0)
...

Now that it can proceed thus far and jump back and forth between C and Scheme, the linker is working fine, asm-return-address calculates the right addr. What could cause the problem?

@cjfrisz BTW, I put a printf() in the linker, riscv64_set_abs(void* address, uptr item) to print the item being relocated. In this way I get the value to be relocated and the addr of the instructions being relocaed. Then in the following code, via gdb, the lui, lui, addi, addi, slli, add sequence produces the right value in t4, and the last jr t5 jumps to the nonprocedure-code in the last comment

   0x3ff7c2e894:    lui t5,0x0        # 0x3ff7c9354b
   0x3ff7c2e898:    lui t4,0xf7c93
   0x3ff7c2e89c:    addi    t5,t5,64
   0x3ff7c2e8a0:    addi    t4,t4,1355
   0x3ff7c2e8a4:    slli    t5,t5,0x20
   0x3ff7c2e8a8:    add t4,t4,t5
   0x3ff7c2e8ac:    ld  s4,5(t4)   
   0x3ff7c2e8b0:    li  t3,3          # t3 is %ac0
   0x3ff7c2e8b4:    ld  t5,13(t4)  # 13 is symbol_pvalue_disp
   0x3ff7c2e8b8:    jr  t5

I wonder how can the right addr give me something so strange.

It's been a little while since I dug around in the linker, so I'd need to study that code before I'd have any deep insight.

The only thing that sticks out to me is this: did you set %cond as a reserved register in define-registers, or is it in the list of allocable registers? If it's allocable, then you'll need to use asm-kill in any define-instruction that uses it. If it's reserved, it probably needs to be treated similarly to %sfp and %ap. I think the latter route may be kind of nasty in cpnanopass.ss, so I'd personally allocate an unspillable for each instruction that uses it.

I think I'm taking a bit of a wild swing, but I think it's possible to get the kind of behavior you're seeing if %cond isn't getting saved and restored properly.

It's been a little while since I dug around in the linker, so I'd need to study that code before I'd have any deep insight.

The only thing that sticks out to me is this: did you set %cond as a reserved register in define-registers, or is it in the list of allocable registers? If it's allocable, then you'll need to use asm-kill in any define-instruction that uses it. If it's reserved, it probably needs to be treated similarly to %sfp and %ap. I think the latter route may be kind of nasty in cpnanopass.ss, so I'd personally allocate an unspillable for each instruction that uses it.

I think I'm taking a bit of a wild swing, but I think it's possible to get the kind of behavior you're seeing if %cond isn't getting saved and restored properly.

The greetings finally appears:

zachary@debian-rv64 ~/C/r/bin (new-linker)> uname -a
Linux debian-rv64 5.17.0-1-riscv64 #1 SMP Debian 5.17.3-1 (2022-04-18) riscv64 GNU/Linux
zachary@debian-rv64 ~/C/r/bin (new-linker)> ./scheme -b ~/petite.boot ~/scheme.boot
Petite Chez Scheme Version 9.5.7
Copyright 1984-2021 Cisco Systems, Inc.

 (+ 1 2 3 4)
10
 (map (lambda (x) (* x x)) (iota 10))
Exception: illegal instruction.  Some debugging context lost
 (fl+ 1.5 1.5)
0.0

Yet bugs still exist:

floating point ops are wrong
SIGILL when using lambdas
no "> " in the prompt

That’s great! It looks like there’s probably some issues with instruction encodings, plus maybe something not quite right in copying to and from floating point registers for floating point ops. Excellent job on getting the REPL to load!

That’s great! It looks like there’s probably some issues with instruction encodings, plus maybe something not quite right in copying to and from floating point registers for floating point ops. Excellent job on getting the REPL to load!

It's working:

zachary@debian-rv64 ~> uname -a
Linux debian-rv64 5.17.0-2-riscv64 #1 SMP Debian 5.17.6-1 (2022-05-15) riscv64 GNU/Linux
zachary@debian-rv64 ~> ChezScheme/rv64le/bin/rv64le/scheme -b ./petite.boot scheme.boot
Petite Chez Scheme Version 9.5.7
Copyright 1984-2021 Cisco Systems, Inc.

> ((lambda (x) 
     ((lambda (y) 
        ((lambda (z) (printf "~a~a~a~n" x y z)) "s")) "e")) "y")
yes
> (fl+ 1.2 1.2)
2.4
> (fl+ 1.2 1.2 3.4 5.12345)
10.923449999999999

BUT, I have no idea why. I just gave two machine-dependent regs two aliases and used them in the assembler. Still cannot bootstrap, there's an error:

compiling cmacros.ss with output to cmacros.so
compiling ../nanopass/nanopass.ss with output to nanopass.so
compiling ../nanopass/nanopass/language.ss with output to nanopass/language.so
compiling ../nanopass/nanopass/helpers.ss with output to nanopass/helpers.so
compiling ../nanopass/nanopass/implementation-helpers.chezscheme.sls with output to nanopass/implementation-helpers.chezscheme.so
Exception in close-port: failed on #<binary output port nanopass.so>: bad file descriptor

That’s great! It looks like there’s probably some issues with instruction encodings, plus maybe something not quite right in copying to and from floating point registers for floating point ops. Excellent job on getting the REPL to load!

It's bootstrapped (at least on my machine). Repo at: https://github.com/maoif/ChezScheme. Configure using ./configure -m=rv64le.

Thanks for your help!

Currently the function of ffi is limited.

Please tell me what requirements should be met for it to get merged into the main branch.

Thanks again, @maoif, for your work on the RISC-V backend — now merged and passing all tests that I've tried on both emulated and real hardware.

cisco / ChezScheme

Some questions regarding porting Chez Scheme to RISC-V #601