chipsalliance / VeeRwolf

FuseSoC-based SoC for VeeR EH1 and EL2
268 stars 61 forks source link

nondeterministic and incorrect behavior of code ran over SweRVolf on Nexys-A7 #22

Closed monniaux closed 3 years ago

monniaux commented 3 years ago

SweRVolf release: 904f69e

The following implementation of QuickSort is deterministic and contains no undefined behavior (checked with tis-interpreter and with CompCert interpreter; compiles with no warning and runs perfectly otherwise). It should output sorted = true.

#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
#include <inttypes.h>

typedef uint32_t data;
void quicksort(data *A, int len);
void data_vec_random(data *a, unsigned len);
bool data_vec_is_sorted(const data *a, unsigned len);

/* Rosetta Code */
void quicksort(data *A, int len) {
  if (len < 2) return;

  data pivot = A[len / 2];

  int i, j;
  for (i = 0, j = len - 1; ; i++, j--) {
    while (A[i] < pivot) i++;
    while (A[j] > pivot) j--;

    if (i >= j) break;

    data temp = A[i];
    A[i]     = A[j];
    A[j]     = temp;
  }

  quicksort(A, i);
  quicksort(A + i, len - i);
}

data data_random(void) {
  static uint64_t next = 1325997111;
  next = next * 1103515249 + 12345;
  return next;
}

void data_vec_random(data *a, unsigned len) {
  for(unsigned i=0; i<len; i++) {
    a[i] = data_random() % 100;
  }
}

bool data_vec_is_sorted(const data *a, unsigned len) {
  for(unsigned i=0; i<len-1; i++) {
    if (a[i] > a[i+1]) return false;
  }
  return true;
}

#if 0
#include "cycles.h"
#else
typedef int cycle_t;
static cycle_t get_cycle(void) { return 0; }
static void cycle_count_config(void) { }
#define PRcycle "d"
#endif

#define len 1000
static data vec[len];

int main (void) {
  //  printf("stack size = %u\n", CONFIG_MAIN_STACK_SIZE);
  cycle_count_config();
  data_vec_random(vec, len);
  cycle_t quicksort_time = get_cycle();
  quicksort(vec, len);
  quicksort_time = get_cycle() - quicksort_time;
  printf("sorted=%s\n"
     "time cycles:%" PRcycle "\n",
     data_vec_is_sorted(vec, len)?"true":"false",
     quicksort_time);
#if 1
  for(int i=0; i<len-1; i++) {
    if (vec[i] > vec[i+1]) {
      printf("%d %d %d\n", i, (int) vec[i], (int) vec[i+1]);
    }
  }
#endif
  return 0;
}

I compiled it and ran it on SweRVolf on Nexys-A7 (tried both with netlist generation with Vivado 2018.2 and Vivado 2020.1). Executions are incorrect and nondeterministic :

*** Booting Zephyr OS build v2.4.0-rc2-106-g25c0810ae2e6  ***
sorted=false
time cycles:0
25 3 1
660 64 32
*** Booting Zephyr OS build v2.4.0-rc2-106-g25c0810ae2e6  ***
sorted=false
time cycles:0
770 74 68
808 79 77

The problem applies whether this program is compiled by Zephyr's default gcc, or compiled with CompCert then assembled by Zephyr's gcc.

I also had weird behavior with other programs.

quicksort.tar.gz

olofk commented 3 years ago

ok, that sounds really bad. Thanks for the detailed analysis. I'm very busy ATM but I'll try to find some time to look at it. First thing I'll do will be to run it in a verilator simulation to see if it has the same behaviour there

monniaux commented 3 years ago

The problem still occurs if the netlist is generated with Vivado 2020.2.

The problem does NOT occur if using Verilator simulation.

Booting Zephyr OS build v2.4.0-rc2-106-g25c0810ae2e6 sorted=true time cycles:0

olofk commented 3 years ago

Hi,

Thanks for the info. It has been high on my priority list to get this sorted out but I haven't been able to make time for it until now. I'll start with some testing to make sure I can reproduce it here. Which release of the SweRVolf rtl are you using?

olofk commented 3 years ago

FYI, I can reproduce the issue here, which is a good start

monniaux commented 3 years ago

fusesoc/fusesoc_libraries/swervolf$ git rev-parse --short HEAD 2e1802e

olofk commented 3 years ago

Some observations so far:

  1. Whenever I recompile the program and load it with openocd for the first time, it will crash before outputting anything. If I then dump the beginning of the memory content with mdw 0 16 I can see that parts of the memory has been overwritten by something. If I load the same program again, it usually works. This makes me suspect some cache issue.
  2. I changed the main loop a bit like this
    while(1) {
        int i = 0;
        do {
          data_vec_random(vec, len);
          quicksort(vec, len);
          i++;
        } while(data_vec_is_sorted(vec, len));
        printf("Error after %d rounds\n", i);
      }
      return 0;

    and I'm varying len to find clues. It's clear that shorter lengths makes the problem less common, but I haven't found if there is some specific value where it disappears completely. At len 500, it runs perhaps on average 4-80 rounds before crashing. At len 460 I see 400-7000 rounds before a crash and with len 450 I have been running ten minutes without crashes.

So not really any new info but wanted to leave a trail for myself and anyone else looking into this

monniaux commented 3 years ago

I'm not getting crashes, just incorrect sorting. Maybe it depends on the compilation chain in use.

olofk commented 3 years ago

I see now that I wasn't very clear. After uploading the first time after recompilation I (almost) always get a crash where part of the memory gets overwritten

On subsequent runs with the same elf I don't get crashes, just incorrect sorts. And the longer len, the more likely that the sort is incorrect

olofk commented 3 years ago

iladata.tar.gz Another update. I have rewritten the test application further to make it easier to find and trigger the value that gets sorted in the wrong bin. I have also added probes inside the FPGA. Yesterday I managed to catch a case where a value was written to memory, but the old value is read back a while later. At this point is very likely an issue with the AXI subsystem or the memory controller and I will add more probes along the way.

Attaching a VCD dump of the Vivado capture for reference. We can see three reads from 0x5020 (initiated at times 180, 325 and 395. Each time we read back 0x5d.

At time 651 we can see that the value 0x5f gets written to address 0x5020. Later however, at time 921 we initiate a read to address 0x5020 and now we get back the old value 0x5d instead.

olofk commented 3 years ago

@monniaux I tracked the issue down to the litedram DDR controller but didn't pursue it further. Instead I generated a new litedram instance since there seems to have beeen a lot of bug fixes since the last one. With the updated version I'm getting consistent results with the sorting and can't reproduce the problem. Would you mind trying it as well to make sure it works on at least two different system. A preliminary version is pushed to the memfix branch for now

monniaux commented 3 years ago

Using the 'memfix' branch:

My 'quicksort' and 'number theoretic transform' examples now run. Quicksort tested with both gcc and CompCert .

olofk commented 3 years ago

Fantastic! Sorry for taking so long. In hindsight I see it would had saved me a lot of work if I had prioritized this bug report

enjoy-digital commented 3 years ago

@olofk: That's great to see LiteDRAM used here. Sorry for the time you spent looking at this if this issue was related to the core. Improvements are still made to the core and that's good to see you've been able to update the core easily and that it fixed your issue. Happy to help on similar issues in the future if you suspect this are related the core.

olofk commented 3 years ago

Hi @enjoy-digital . LiteDRAM is great since it saves me from having to deal with FPGA vendor memory IP which usually is a pain. I was planning to contact you when I realized the bug was in LiteDRAM but thought I'd update first since many things had changed. That seems to have worked. It's good to see also that you have started with tagged releases. That makes it much easier to maintain than relying on a random combination of git SHAs

monniaux commented 3 years ago

ntt.tar.gz The ntt sample benchmark. This is a kind of Fast Fourier Transform in a finite field, using integer arithmetic (modulus, multiplication, addition, subtraction). FFT then reverse-FFT must yield exactly the original array, otherwise it means there was some bug.

agrobman commented 3 years ago

@monniaux , should this ntt program work with reduced buffer? instead of 64KB buffer with 4KB buffer ? I'm getting buffer miscompare : compare = -101

monniaux commented 3 years ago

NTT needs 65536 elements in the buffer, but it should be possible to modify it for 256 elements.

De: "agrobman" notifications@github.com À: "chipsalliance/Cores-SweRVolf" Cores-SweRVolf@noreply.github.com Cc: "DAVID MONNIAUX" David.Monniaux@univ-grenoble-alpes.fr, "Mention" mention@noreply.github.com Envoyé: Vendredi 12 Février 2021 16:12:56 Objet: Re: [chipsalliance/Cores-SweRVolf] nondeterministic and incorrect behavior of code ran over SweRVolf on Nexys-A7 (#22)

[ https://github.com/monniaux | @monniaux ] , should this ntt program work with reduced buffer? instead of 64KB buffer with 4KB buffer ? I'm getting buffer miscompare : compare = -101

— You are receiving this because you were mentioned. Reply to this email directly, [ https://github.com/chipsalliance/Cores-SweRVolf/issues/22#issuecomment-778253005 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AAZZPZWDSRIT7HVMGYU3TZLS6VAPRANCNFSM4RZZO6RQ | unsubscribe ] .

agrobman commented 3 years ago

thanks, but to get maximum performance from SweRV CPU the data/stack should be placed to DCCM. I'm not sure what SweRV CPU config is used in SweRVolf, but to get good numbers with this program, you probably need to increase DCCM size. ( SweRV CPU has only 64KB DCCM with default configuration.)

monniaux commented 3 years ago

This program has no particular interest except for testing scheduling inside our compiler.

For getting it to run with 256 words instead of 65536: replace the appropriate lines in ntt.c

define LOG_LENGTH 8

define LENGTH 256

define MUL_MODULUS 256

define MODULUS 257