[FIX] spike-diff/difftest.cc: flush the tlb before copy memory towards ref

Without flushing the tlb before copy memory towards ref, the copying may result in wrong results on ref.

Why I found this problem: The reason is interesting and can serve as an example of how things go wrong. The issue I met is that when implementing difftest detach & attach in PA3, I copied a series of instructions to RESET_VECTOR like csrw ... to set the CSR on ref properly. I copied momory two times: one for csrw instrucctions, and the other for the whole dut memory. However, my difftest results tell me that the second copy didn't work as expected. And I found that there is an icache in spike, which is not flushed properly in the interface functions provided, thus the latter copy didn't work.

Although your patch is a solution to the issue, I suggest to let Spike execute a fence.i instruction right after the second memory copy. The semantics of diff_memcpy() only cares about behavior in the memory level. Synchronization between memory and the instruction stream should be considered by CPU, which is exactly the aim of fence.i.

Alternatively, you may put the CSR instruction sequence somewhere else to execute.

It is the programmer's responsibility to guarantee that two instruction sequences do not overlap. If they do, use a fence.i before executing the second.

Although your patch is a solution to the issue, I suggest to let Spike execute a fence.i instruction right after the second memory copy. The semantics of diff_memcpy() only cares about behavior in the memory level. Synchronization between memory and the instruction stream should be considered by CPU, which is exactly the aim of fence.i.

Alternatively, you may put the CSR instruction sequence somewhere else to execute.

It is the programmer's responsibility to guarantee that two instruction sequences do not overlap. If they do, use a fence.i before executing the second.

Thanks for your reply! In my opinion, maybe using a fence.i instruction to synchronize from the software side is a more elegant way, WHEN in riscv programs in which the instructions are modified by themselves. However, I think(and I assume) the diff_memcpy serves as a method to modify the inside program(the ref) from the outside(the caller), and the modification is not predictable by the inside. Thus in my implementation the duty of running fence.i shall be done by the outside, who takes the responsibility of modifying the inside.

To make the API diff_memcpy useful, it shall execute fence.i automatically after its call, no matter how it does, maybe using a true instruction fence.i, or simply running mmu->flush_tlb();. If it does not do that, nothing is guarenteed and its functionality is not complete. Thus I do not agree that the semantics of diff_memcpy do not include flushing the icache and pipeline.

Anyway, various methods can be applied based on concrete implementations and the understanding of the project code. And I mark the PR as closed.

Below are some further discussions..

Sometime you can not call something like mmu->flush_tlb() outside the reference design. Consider a real processor designed by Verilog. The simulation environment may not have the control of the components inside the processor.
In a real processor, there may be more components to synchronize, such as branch predictor, instruction TLB, instruction buffer in the pipeline. Anything related to instruction should be under consideration. fence.i is a better choice, since it is implemented inside the processor, which has already considered all of them.
I think what you really want is an API like diff_memcpy_and_sync(). But from the aspect of API design, it is better to split it into two simpler APIs, diff_memcpy() and diff_sync(), for the sake of reusability. diff_memcpy() performs like DMA in hardware, which makes sense. diff_sync() lets the reference design perform the instruction stream synchronization. There exist some circumstances where you are sure that the copy will not overwrite any instructions. If it is the case, calling diff_memcpy() is enough. If you are not sure, call diff_sync() after that. In your case, you may call diff_sync() after the second diff_memcpy().
the modification is not predictable by the inside

You are right. But the point is not about whether diff_memcpy() is useful or not. Even with diff_memcpy_and_sync(), you can not call it casually during the execution of the guest program. That is, it is not diff_memcpy_and_sync() who makes the modification predictable to the reference design. In fact, it is the caller of diff_memcpy_and_sync() who makes things predictable. The caller should choose a right time and a safe memory region to copy to. But it is difficult to perform this choice automatically by augmenting the behavior of the API. Eventually, the choice is made by the programmer of the caller. Therefore, instead of designing an API which will do everything for you, keeping the minimal behavior of each API is a better idea.

Thank you @sashimi-yzh, that makes sense. Still, I have some questions about that.

I think what you really want is an API like diff_memcpy_and_sync(). But from the aspect of API design, it is better to split it into two simpler APIs, diff_memcpy() and diff_sync(), for the sake of reusability.

Yes, I agree with that. What I want to implement is such a diff_sync() interface.
If I flush the icache and other circuits using the fence.i instruction, I have to know where to find such a fence.i instruction. However, I think this is not easy in the implementation layer. I enumerates some possible implementations:
1. The fence.i is added by ref_memcpy to some address and we ensure the cpu fetches and executes that instruction right after the ref_memcpy. However, this falls into a cat-and-mouse game: to ensure the fence.i instruction will be executed properly, it's necessary to flush the icache before its execution, otherwise the machine may execute what is remained in the icache, instead of fence.i. In a nut shell, we have to flush the icache before execute an instruction which is used to flush the icache. This is not reasonable.
2. The fence.i is laid on some specific address in the ref's memory initially, and to synchronize, we change the pc to the hard-coded address and executes fence.i there. However, I think this either increases the complexity of the hardware design and damages the generality of the architecture, if the initialization is implemented by the hardware; or requires the software to meet some extra constraints(e.g. add some extra code before calling main. Note that the constraints are not standardized), if the initialization is done by the software. Moreover, this method needs to change to the cpu state pc from outside, which may be hard in real design, as you mentioned(Also note that this cannot be implemented by a jump instruction, due to the same reasons in (i.)).
3. The fence.i is executed virtually, not from the memory (maybe a external vector instead?), thus the execution do not relate to icache and avoid the issue mentioned in (i.). I think the additional cost is too high. Also, when it comes to spike, I didn't find such a functionality.
So I wonder whether there is any good implementation of such a logic?
Thus, in my opinion, adding a fence.i to flush the icache cannot easily be implemented. Calling the function provided by the simulation framework is one of the most efficient and easiest way to achieve that.
Consider a real processor designed by Verilog. The simulation environment may not have the control of the components inside the processor.

That's true. But due to the reasons I mentioned above, I do not think that such work can be done from the pure software side, affecting only the memory, without telling the cpu such changes directly.

For your information, the RISC-V processor project RocketChip implements a debug module. Inside the debug module, there is a program buffer which is accessed by MMIO. Instructions fetched from the program buffer will not enter icache.

Outside the real chip, we can use the on chip debugger, such as OpenOCD, to communicate with the debug module. This is achieved by eventually sending some signals to the JTAG pins of the chip. These signals will finally activate the debug module inside the chip. For more information, you may refer to the RISC-V Debug Specification.

Note that this is far away from the original issue you have encountered. I suggest putting the CSR instruction sequence somewhere else to execute. For example, the end of the memory may be a good choice, since it will hardly execute instructions at the end of the memory.

NJU-ProjectN / nemu

[FIX] spike-diff/difftest.cc: flush the tlb before copy memory towards ref #80