Open gapisback opened 1 year ago
I tried running test.sh
on my Linux dev machine and I hit the following issues:
pytest
and protobuf
noenc
; this one is minor, not sure if we should care for openssl v1.x at allSO_REUSEADDR
?sudo
during tests; I understand we need this to emulate SEV
but I won't give root permissions to a test script running on my machineI guess the best way to go here is to provide a docker container with all of the dependencies that can be used for running tests. The Dockerfile can be used as a reference for what packages and configurations are needed if you want to run this on your physical machine.
The Docker container is a good approach. Let's discuss this next week to put this work item in its priority queue.
@rgerganov -- I investigated one of the points you raised above:
_waiting 120sec. for sockets to be closed; we should be able to fix this with SOREUSEADDR?
The sleep is coming in the pytest-case test_certifier_framework.py
in-between channel.init_server_ssl()
and cfm.server_dispatch()
.
In server_dispatch() -> open_server_socket()
, we are using this re-use-address semantic, as seen below:
2283 // Reuse addresses and ports
2284 #define REUSE_SOCKETS_AND_PORTS
2285 #ifdef REUSE_SOCKETS_AND_PORTS
2286 int reuse = 1;
2287 if (setsockopt(sfd,
2288 SOL_SOCKET,
2289 SO_REUSEADDR, <<<<<
2290 (const char *)&reuse,
2291 sizeof(reuse))
2292 < 0) {
2293 fprintf(stderr, "Can't reuse socket %s\n", __func__);
2294 return false;
2295 }
Empirically, testing evidence shows that even though this flag has been used, on dev-machine, if you re-run the test-case repeatedly, the socket is not being closed and re-used fast enough. Hence, this sleep has been added.
This test file test_certifier_framework.py
has several test cases each of which establishes a SSL connect, does some work and then disconnects. So, when run back-to-back in CI, I need a mechanism to ensure that the socket re-use property kicks-in "fast enough".
I know this is irritating, but I don't think this is functionally broken or use of this sleep is masking some broken functionality.
I can investigate further, but debugging this is a bit difficult on CI machines.
@yelvmw -- Here's the trace evidence showing why some of the Python tests in test_certifier_framework.py are hanging when run on the SGX machine.
I think it's because in the shared library, the code for ENABLE_SEV=1
is automatically enabled for all hardware platforms. (I had raised this 'issue' in our previous visits to certifier.mak. At that time, it was felt that it's better to have all code #include'ed by default for all TEEs.)
That I suspect is causing an issue as seen in the execution flow trace instrumentation:
test_certifier_framework.py::test_run_app_as_a_client_init_client_ssl_with_trust_manager
*** cc_trust_manager_get_certified() ...
init_policy_key() ... Starting ...
PublicKeyFromCert() Starting ...
PublicKeyFromCert() ... Starting ...
GetX509FromCert() succeeded.
X509_get_pubkey() succeeded.
X509_NAME_get_text_by_NID() succeeded, len=15.
X509_NAME_get_text_by_NID() succeeded.
PublicKeyFromCert():425 succeeded.
PublicKeyFromCert():432 succeeded.
PublicKeyFromCert():435 Executing ...
get_tcb_version_from_vcek():270 Executing ...
vcek_ext_byte_value():230 Executing ...
Basically, the Python method invokes init_policy_key()
, which goes through till:
get_tcb_version_from_vcek() -> vcek_ext_byte_value()
.
I instrumented the last function, and it's hanging on L232 below:
230 printf("%s():%d Executing ...\n", __func__, __LINE__);
231 // Use OID for both lname and sname so OBJ_create does not fail
232 nid = OBJ_create(oid, oid, oid);
233 printf("%s():%d Executing ...\n", __func__, __LINE__);
OBJ_create()
seems to be a SEV_SNP interface of some sort. I can't locate its sources.
To validate this theory that running SEV_SNP-specific code in this simulated-enclave, I conditionalized the definition of ENABLE_SEV
in certifier.mak and re-built the shared lib:
4 ifndef NO_ENABLE_SEV
5 ENABLE_SEV=1
6 endif
This allows the tests to run cleanly without running into this hanging issue.
It's surprising to see OBJ_create hanging. It's an OpenSSL function. I checked the bugs section of the manual and only see the caveat of concurrent invocation. Is this the case?
@yelvmw - About the concurrent use of OBJ_create()
, the server & client code are being executed concurrently, although there isn't any sort of race condition.
However in the affected test case(s) [e.g., test_run_app_as_a_client_init_client_ssl_with_trust_manager()
], we just start the client-app portion on its own, talking to the Certifier Service. So, no concurrent invocation of said SSL interface should come into play.
@rgerganov : I'm back to cleaning up some items you logged here: "test script deleting non-git files; this is not ok -"C
Firstly, I do need rm_non_git_files
or some hook like that, to clean-up cruft between run-to-run of this run_example.sh
.
Otherwise, I've found that some workflow steps for some sample-apps may interfere causing instability. I just need a way to "cleanup the entire dev env" at the start of each run of this script.
Second, the command does git ls
from certifier-root dir. Why would a user's config / env-files be living inside the certifier-framework-for-confidential-computing/
sub-dir? Wouldn't we expect these to usually live under $HOME
?
Are you worried about IDE config files for this Certifier Framework to be living inside this projects's root sub-dir?
What if we exclude dot-files
; i.e. filter-out \.*
files from the rm
? Would that suffice to address your concern?
Alternate Solution: Let me know if you think this is good enough:
I took a look at the files that do need to be cleaned-up by this command. Seems like if we can filter out the output from git ls-files
to only delete the files in app_data*/
and provisioning
sub-dirs, and pick-up build output files (e.g. *.o
, *.exe
, *wrap*
etc.) that may be sufficient.
This kind of filtering is prone to being incomplete ... but it may be the only reliable way to do this clean-up w/o impacting the user's config / env-files.
--
[Update] I just discovered that the list of files reported by git ls-files . --exclude-standard --others --ignored
is probably sufficient. It includes all objects, exes and SWIG-generated files. Would you be willing to go with this approach?
The idea is that we only rm
files that are in our project's .gitignore
, ... which is not a bad thing, right?
Do you expect user project's config/env-files to live in our project's .gitignore
?
Let me know your thoughts on this approach ... and whether we are really to be worried-about config / env-files in the certifier root-dir.
The newly added
test.sh
script is not working correctly on engineers' Linux dev env. There are some hard-coded dependencies in various build scripts that need to be corrected.There are at least a couple of diff issues to resolve under this issue:
PY_INCLUDE = -I /usr/include/python3.10/
needs to be fixed in all make files.LOCAL_LIB=/usr/local/lib64
fromtest.sh
to sub-scripts / makefiles.Here are some examples of failures: