GaloisInc / BESSPIN-FETT-Portal

The web-based portal used by FETT Researchers to manage Target instances.
Apache License 2.0
0 stars 0 forks source link

Implement rebootable Unix in Portal backend and UI #324

Open andrew-bivin opened 4 years ago

andrew-bivin commented 4 years ago

Once the FETT-Target team has tested rebootable Unix system functionality (described in this closed issue: https://github.com/DARPA-SSITH-Demonstrators/SSITH-FETT-Target/issues/526), we need to implement back- and front-end functionality in FETT Portal for researchers to soft-reboot an instance (rather than re-provision an instance from scratch).

mattlebeau-galois commented 4 years ago

Everything stays provisioned, modifications here are above AWS. (ie; no reflash, etC)

mattlebeau-galois commented 4 years ago

TBD if a UI is needed for this.

mattlebeau-galois commented 4 years ago

Moving back to backlog, based on @rtadros-Galois input.

jrtc27 commented 4 years ago

Could you please summarise the input that led to this decision for those of us not at Galois/Five Talent/etc?

rtadros125 commented 4 years ago

@jrtc27 Given FETT-Target #673 and #674, and given that this is low priority in point of view of DARPA who is prioritizing other tasks, we do not have enough cycles to focus on debugging these (basically #673). The interns will be poking at them; we'll include the feature if the bugs are totally solved instead of having a broken feature.

jrtc27 commented 4 years ago

673 doesn't matter, you can just disable the button for Firesim (I'm not aware of any Firesim TA-1 teams who need it), no? I also don't see why #674 is an issue; surely rebooting is still faster than provisioning+booting+configuring in the event of, say, an attack causing the kernel to hang (or panic but fail to reboot itself)?

rwatson commented 4 years ago

We view fast recovery from crashes as a key feature to allow FETT researchers to operate efficiently with the forthcoming CHERI-RISC-V Release 2, in which kernel debugging will play an essential role. Could you let us know what current reset vs re-provision times are for CHERI-RISC-V nodes?

rtadros125 commented 4 years ago

@jrtc27 @rwatson I think we might be speaking about two different things, or maybe we're not. Let me elaborate:

  1. Regarding the "rebootable Unix" phrasing:

    • Both of Cheri variants (default and purecap) are completely rebootable from the researcher end. So they can su, then reboot and it will work fine. Or, if they cause a kernel panic or lose network, they can use the UART piping to reboot as well. This ticket is not related to that.
    • Other connectal targets are mostly rebootable too. We haven't had the time and prioritization to work on that aspect, but it's in the horizon.
    • Firesim targets are the problem and the main motivation to (2).
  2. The portal "reset" button:

    • After the instance is running, the researchers can press either terminate or reset target. This would reload the FPGAs and boot the OS again, but using the same filesystem and without any tests.
    • ticket #674: Jessica said "It's not an issue". Well, if you press reset, it wouldn't work because of the timeout. So it is an issue. Trivially fixable I understand, but we won't make it until next week's release according to Galois-FTS-Synack agreed release flow.
    • ticket #673: Since the reset button feature is motivated primarily by firesim, then we have to fix it first before having the feature released. This is a very time consuming ticket to reproduce and debug. And that is why it's affected by planning/prioritization.
  3. Answering Robert's specific questions:

    • Kernel debugging is possible and is not related to this ticket. Am I missing something?
    • For Cheri nodes, rough estimates would be:
      • default: reset --> 10-11 minutes. provision --> 16-17 minutes.
      • purecap: reset --> 24-25 minutes (presumably, currently it's timing out on 21+2=23 minutes). provision--> 32-33 minutes.
jrtc27 commented 4 years ago
  • Both of Cheri variants (default and purecap) are completely rebootable from the researcher end. So they can su, then reboot and it will work fine. Or, if they cause a kernel panic or lose network, they can use the UART piping to reboot as well. This ticket is not related to that.

Ok, so C-a r will be documented as the supported way to reboot CHERI instances? Button in the UI would of course be nicer, but that approach is sufficient (and they probably will have the UART logs up anyway to see what happened).

rtadros125 commented 4 years ago

Ok, so C-a r will be documented as the supported way to reboot CHERI instances? Button in the UI would of course be nicer, but that approach is sufficient (and they probably will have the UART logs up anyway to see what happened).

@mattlebeau-galois Can you please add the use of Ctrl+A + r to reboot Cheri to portal learn? I'd say in the same sentence as reboot is mentioned.

rwatson commented 4 years ago

Hi @rtadros-Galois: Yes, I think I misunderstood the title of this issue. As long as the portal reset button works, resetting the state of the FPGA and OS kernel, but leaving the same filesystem in place, I think we are OK.

mattlebeau-galois commented 4 years ago

Ok, so C-a r will be documented as the supported way to reboot CHERI instances? Button in the UI would of course be nicer, but that approach is sufficient (and they probably will have the UART logs up anyway to see what happened).

@mattlebeau-galois Can you please add the use of Ctrl+A + r to reboot Cheri to portal learn? I'd say in the same sentence as reboot is mentioned.

@rtadros-Galois - Added a story for this over in #445 - May I ask for your review/edits of the AC on that ticket to verify that I've correctly transposed the instructions, please?