Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory

gapisback commented 1 year ago

The newly added test.sh script is not working correctly on engineers' Linux dev env. There are some hard-coded dependencies in various build scripts that need to be corrected.

There are at least a couple of diff issues to resolve under this issue:

Dir-path for PY_INCLUDE = -I /usr/include/python3.10/ needs to be fixed in all make files.
Support passing down LOCAL_LIB=/usr/local/lib64 from test.sh to sub-scripts / makefiles.

Here are some examples of failures:

483 + make -f certifier_tests.mak --always-make -j8 sharedlib LOCAL_LIB=/usr/local/lib64
484 protoc --cpp_out=. --proto_path ../certifier_service/certprotos ../certifier_service/certprotos/certif    ier.proto
485
486 compiling support.cc
[...]

550 compiling x509_tests.cc
551 g++ -I ../include -I/usr/local/opt/openssl@1.1/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -o ./x509_tests.o -c x509_tests.cc
552 compiling certifier_tests_wrap.cc
553 g++ -I ../include -I/usr/local/opt/openssl@1.1/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -I /usr/include/python3.10/ -o ./ce    rtifier_tests_wrap.o -c certifier_tests_wrap.cc
554 certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory
555   178 | # include <Python.h>
556       |           ^~~~~~~~~~
557 compilation terminated.
558 make: *** [certifier_tests.mak:195: certifier_tests_wrap.o] Error 1
559 make: *** Waiting for unfinished jobs....
560 ++ cleanup
561 ++ set +x
562
563 test.sh: **** Error **** Failed command, make -f certifier_tests.mak --always-make -j${NumMakeThreads}     sharedlib LOCAL_LIB=/usr/local/lib64, at line 130 ./CI/scripts/test.sh while executing function test-core-certifier-programs

rgerganov commented 1 year ago

I tried running test.sh on my Linux dev machine and I hit the following issues:

compilation failures due to hardcoded Python include path (submitted PR #216)
missing dependencies like pytest and protobuf
incompatible openssl options; I have openssl 1.1.1 and it doesn't support noenc; this one is minor, not sure if we should care for openssl v1.x at all
waiting 120sec. for sockets to be closed; we should be able to fix this with SO_REUSEADDR?
test script deleting non-git files; this is not ok -- it's common practice to keep all kind of files in the project dir like virtual environments, IDE configurations, etc. and deleting those is quite unexpected and frustrating
using sudo during tests; I understand we need this to emulate SEV but I won't give root permissions to a test script running on my machine

I guess the best way to go here is to provide a docker container with all of the dependencies that can be used for running tests. The Dockerfile can be used as a reference for what packages and configurations are needed if you want to run this on your physical machine.

gapisback commented 1 year ago

The Docker container is a good approach. Let's discuss this next week to put this work item in its priority queue.

gapisback commented 1 year ago

@rgerganov -- I investigated one of the points you raised above:

_waiting 120sec. for sockets to be closed; we should be able to fix this with SOREUSEADDR?

The sleep is coming in the pytest-case test_certifier_framework.py in-between channel.init_server_ssl() and cfm.server_dispatch().

In server_dispatch() -> open_server_socket(), we are using this re-use-address semantic, as seen below:

2283       // Reuse addresses and ports
2284 #define REUSE_SOCKETS_AND_PORTS
2285 #ifdef REUSE_SOCKETS_AND_PORTS
2286     int reuse = 1;
2287     if (setsockopt(sfd,
2288                    SOL_SOCKET,
2289                    SO_REUSEADDR,      <<<<<
2290                    (const char *)&reuse,
2291                    sizeof(reuse))
2292         < 0) {
2293       fprintf(stderr, "Can't reuse socket %s\n", __func__);
2294       return false;
2295     }

Empirically, testing evidence shows that even though this flag has been used, on dev-machine, if you re-run the test-case repeatedly, the socket is not being closed and re-used fast enough. Hence, this sleep has been added.

This test file test_certifier_framework.py has several test cases each of which establishes a SSL connect, does some work and then disconnects. So, when run back-to-back in CI, I need a mechanism to ensure that the socket re-use property kicks-in "fast enough".

I know this is irritating, but I don't think this is functionally broken or use of this sleep is masking some broken functionality.

I can investigate further, but debugging this is a bit difficult on CI machines.

gapisback commented 1 year ago

@yelvmw -- Here's the trace evidence showing why some of the Python tests in test_certifier_framework.py are hanging when run on the SGX machine.

I think it's because in the shared library, the code for ENABLE_SEV=1 is automatically enabled for all hardware platforms. (I had raised this 'issue' in our previous visits to certifier.mak. At that time, it was felt that it's better to have all code #include'ed by default for all TEEs.)

That I suspect is causing an issue as seen in the execution flow trace instrumentation:

test_certifier_framework.py::test_run_app_as_a_client_init_client_ssl_with_trust_manager
*** cc_trust_manager_get_certified() ...

init_policy_key() ... Starting ...
PublicKeyFromCert() Starting ...
PublicKeyFromCert() ... Starting ...
GetX509FromCert() succeeded.
X509_get_pubkey() succeeded.
X509_NAME_get_text_by_NID() succeeded, len=15.
X509_NAME_get_text_by_NID() succeeded.
PublicKeyFromCert():425 succeeded.
PublicKeyFromCert():432 succeeded.
PublicKeyFromCert():435 Executing ...
get_tcb_version_from_vcek():270 Executing ...
vcek_ext_byte_value():230 Executing ...

Basically, the Python method invokes init_policy_key(), which goes through till:

get_tcb_version_from_vcek() -> vcek_ext_byte_value().

I instrumented the last function, and it's hanging on L232 below:

 230   printf("%s():%d Executing ...\n", __func__, __LINE__);
 231   // Use OID for both lname and sname so OBJ_create does not fail
 232   nid = OBJ_create(oid, oid, oid);
 233   printf("%s():%d Executing ...\n", __func__, __LINE__);

OBJ_create() seems to be a SEV_SNP interface of some sort. I can't locate its sources.

To validate this theory that running SEV_SNP-specific code in this simulated-enclave, I conditionalized the definition of ENABLE_SEV in certifier.mak and re-built the shared lib:

  4 ifndef NO_ENABLE_SEV
  5     ENABLE_SEV=1
  6 endif

This allows the tests to run cleanly without running into this hanging issue.

yelvmw commented 1 year ago

It's surprising to see OBJ_create hanging. It's an OpenSSL function. I checked the bugs section of the manual and only see the caveat of concurrent invocation. Is this the case?

gapisback commented 1 year ago

@yelvmw - About the concurrent use of OBJ_create(), the server & client code are being executed concurrently, although there isn't any sort of race condition.

However in the affected test case(s) [e.g., test_run_app_as_a_client_init_client_ssl_with_trust_manager()], we just start the client-app portion on its own, talking to the Certifier Service. So, no concurrent invocation of said SSL interface should come into play.

gapisback commented 1 year ago

@rgerganov : I'm back to cleaning up some items you logged here: "test script deleting non-git files; this is not ok -"C

Firstly, I do need rm_non_git_files or some hook like that, to clean-up cruft between run-to-run of this run_example.sh.

Otherwise, I've found that some workflow steps for some sample-apps may interfere causing instability. I just need a way to "cleanup the entire dev env" at the start of each run of this script.

Second, the command does git ls from certifier-root dir. Why would a user's config / env-files be living inside the certifier-framework-for-confidential-computing/ sub-dir? Wouldn't we expect these to usually live under $HOME? Are you worried about IDE config files for this Certifier Framework to be living inside this projects's root sub-dir?

What if we exclude dot-files; i.e. filter-out \.* files from the rm? Would that suffice to address your concern?

Alternate Solution: Let me know if you think this is good enough:

I took a look at the files that do need to be cleaned-up by this command. Seems like if we can filter out the output from git ls-files to only delete the files in app_data*/ and provisioning sub-dirs, and pick-up build output files (e.g. *.o, *.exe, *wrap* etc.) that may be sufficient.

This kind of filtering is prone to being incomplete ... but it may be the only reliable way to do this clean-up w/o impacting the user's config / env-files.

-- [Update] I just discovered that the list of files reported by git ls-files . --exclude-standard --others --ignored is probably sufficient. It includes all objects, exes and SWIG-generated files. Would you be willing to go with this approach?

The idea is that we only rm files that are in our project's .gitignore, ... which is not a bad thing, right?

Do you expect user project's config/env-files to live in our project's .gitignore?

Let me know your thoughts on this approach ... and whether we are really to be worried-about config / env-files in the certifier root-dir.

ccc-certifier-framework / certifier-framework-for-confidential-computing

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215