firesim / FireMarshal

Software workload management tool for RISC-V based SoC research. This is the default workload management tool for Chipyard and FireSim.
https://docs.fires.im/en/latest/Advanced-Usage/Workloads/index.html
Other
76 stars 51 forks source link

Add baremetal CI #254

Closed abejgonzalez closed 1 year ago

abejgonzalez commented 2 years ago

This adds basic CI testing FireMarshal using the existing fullTests.py script. This uses the same CI machine as FireSim local FPGA support for now. This can be parallelized in the future.

abejgonzalez commented 2 years ago

Currently, there are issues with running a screen session. See "Must be connected to a terminal" in https://github.com/firesim/FireMarshal/actions/runs/3298904128/jobs/5441742036#step:3:70

abejgonzalez commented 2 years ago

For more context:

This PR is adding CI to run FireMarshal's fullTest.py script which should run all the default FireMarshal test suite. Under the hood this fullTest script is running marshal test <...> for a variety of workloads found in test/. The current issue is that this marshal test <...> command creates a screen session (see ref 1) that can't run in the current GH-A shell. This is the error that is seen: https://github.com/firesim/FireMarshal/pull/254#issuecomment-1287239154. I don't know if this issue is due to needing to call subprocess.POpen with the shell=True flag or I need to somehow change the default shell of GH-A to allow for screen sessions. If this issue (screen can be called within marshal test ... within a GH-A shell, then I expect the fullTest.py to work and for us to have a functioning CI workflow for this repo!

Any help on this would be appreciated. Otherwise I can look into it later (in a month or so).

Ref 1: https://github.com/firesim/FireMarshal/blob/d38234c2e17aaa87d10a10b312b380a1a8e89257/wlutil/launch.py#L142-L145

TLDR: screen doesn't seem to work either within a python script (using subprocess.POpen) or just directly called from the cmdline. If someone can figure out what this issue is then this CI should work.

tianrui-wei commented 2 years ago

@abejgonzalez The PR should be working now, thanks for the patches

abejgonzalez commented 2 years ago

@abejgonzalez The PR should be working now, thanks for the patches

If you have the time, parallelizing the fullTests.py script would be nice

tianrui-wei commented 1 year ago

From what I can see at the moment, the parallisation of fulltest.py fails bc of the way firemarshal handles workdir. Because there's only one workdir, it's very easy to have conflicts between different jobs if they're run in parallel

tianrui-wei commented 1 year ago

You can see the changes at https://github.com/firesim/FireMarshal/tree/use-conda-fix

abejgonzalez commented 1 year ago

Are we sure that this doesn't rebuild br-base for every single job (that shares the same br-base spec)(ditto for fedora)? I think that the current repo setup overbuilds things. Instead we should do the following:

  1. Read the parent specification of the current workload yaml
  2. If the parent specification image was built, then just copy the image into a new name (adding a hash for the specification contents) and copy the workload files into that image.
  3. If the parent specification image is not built, build it, and name the image with a hash of the specification (and maybe files used i.e. hash of git repos). Then do step 2.

Depending on the time I have in the next week I can take a stab at this / look into if things are properly cached.

abejgonzalez commented 1 year ago

In the interim, I'll just run the bare-metal tests (as a sanity check) while we figure out the situation w/ parallelization/extra-builds.