ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
19 stars 6 forks source link

RACK docker image doesn't work on Apple Silicon #407

Open Robert-Adelard opened 3 years ago

Robert-Adelard commented 3 years ago

If I try to run the RACK docker image on my MacBook M1, which has an ARM processor, it fails silently but stays running.

In particular, Docker Desktop shows the Docker image running, but it is unresponsive.

I would be happy to try and debug this if you can tell me how to proceed.

In the meantime, is it possible to build a version of the RACK image for ARM64? Or could you tell me how to do this for myself?

As a workaround, I can run the RACK on my MacBook by adding --platform linux/amd64 to the docker command, which runs the Intel image under emulation, as per the instructions here:

https://docs.docker.com/docker-for-mac/apple-silicon/

However, this is obviously less efficient, so an ARM64 build would be preferable.

Thank you.

Robert-Adelard commented 3 years ago

With further experience, I'm not sure that adding --platform linux/amd64 is sufficient - in particular, although I can visit the home page of the RACK running on my MacBook, if I try to contact the NodeGroupService via the REST API, I get an error message to say that the connection was closed unexpectedly.

image

I would be happy to try and debug this, but I need some advice on how to proceed.

glguy commented 3 years ago

@tuxji Is an ARM64 build an option for us?

tuxji commented 3 years ago

I did some googling on building multi-architectureDocker images with GitHub Actions. GitHub Actions CI doesn't have any Apple M1 Silicon MacOS runners at this time, only Intel x64 MacOS runners. We can't use "docker buildx" to build a multi-architecture image either because we actually build the RACK box with Packer's Docker builder, not with docker itself. The only remaining way I can see to build a linux/arm64 image is for @Robert-Adelard to download Packer and all the necessary files (into RACK/rack-box/files) and run "packer build rack-box-docker.json" in RACK/rack-box manually on his own MacBook M1 as described in rack-box's README. Robert, I think Packer's Docker builder would build a linux/arm64 image on your MacBook M1 for you and then you could simply run that image in Docker.

We also build a rack-box VirtualBox image, but it doesn't look like there are any plans to make VirtualBox work on the MacBook M1. You could install Ubuntu x64 in Parallels and try to run either Docker or VirtualBox within that, but I'm not sure whether that doubly nested virtualization would work or run fast enough.

Robert-Adelard commented 3 years ago

Hi John,

Thanks - that's helpful. Unfortunately, when I tried to build a Docker image for linux/arm64 following the instructions, the build process hung at the point where it starts the nodeGroupService.

Here is the relevant part of the output:

    [...]
    docker: Creating home directory `/home/ubuntu' ...
    docker: Copying files from `/etc/skel' ...
    docker: Adding system user `fuseki' (UID 105) ...
    docker: Adding new group `fuseki' (GID 106) ...
    docker: Adding new user `fuseki' (UID 105) with group `fuseki' ...
    docker: Not creating home directory `/home/fuseki'.
    docker: /home/ubuntu/semtk-opensource /home/ubuntu/semtk-opensource
    docker: Returning (already ran .env)
    docker: /home/ubuntu/semtk-opensource
    docker: sparqlGraph/main-oss/sparqlgraphconfigOss.js
    docker: sparqlGraph/main-oss/KDLEasyLoggerConfigOss.js
    docker: sparqlForm/main-oss/sparqlformconfig.js
    docker: sparqlForm/main-oss/KDLEasyLoggerConfig.js
    docker: ./updateWebapps.sh done
    docker: Waiting for Fuseki at http://localhost:3030...
    docker: Waiting for nodeGroupService at http://localhost:12059...
    docker: Error: Took longer than 600 seconds to start nodeGroupService
==> docker: Provisioning step had errors: Running the cleanup provisioner, if present...
==> docker: Killing the container: 4e3e98c11428fad90a24f197e8e535deef733d7d19009985ee78c7e807a22e5e
Build 'docker' errored after 11 minutes 40 seconds: Script exited with non-zero exit status: 1.Allowed exit codes are: [0]

==> Wait completed after 11 minutes 40 seconds

==> Some builds didn't complete successfully and had errors:
--> docker: Script exited with non-zero exit status: 1.Allowed exit codes are: [0]

==> Builds finished but no artifacts were created.

I was also unable to build a binary for the RACK CLI - I wasn't sure why this step was necessary, but I've been having problems installing the relevant Python packages, so if the build process uses the RACK CLI internally, this might explain why it failed.

Robert-Adelard commented 3 years ago

I can certainly try running the RACK image on a virtual Ubuntu or Windows system, but I'm not sure what effect the virtualisation would have on performance.

Do you have any suggestions for debugging either the RACK build for ARM or why the x86 version of the RACK doesn't work if I use --platform linux/amd64 to run the Intel image under emulation?

If I log on to the x86 version of the RACK, ps ax gives the following output - does this look normal?

  PID TTY      STAT   TIME COMMAND
    1 pts/0    Ssl+  17:06 /usr/bin/qemu-x86_64 /usr/bin/python3 /usr/bin/systemctl
  588 ?        Ssl    0:00 /usr/bin/qemu-x86_64 /usr/sbin/nginx -g daemon on; master_process on;
  591 ?        Sl     0:00 /usr/bin/qemu-x86_64 /usr/sbin/nginx -g daemon on; master_process on;
  592 ?        Sl     0:00 /usr/bin/qemu-x86_64 /usr/sbin/nginx -g daemon on; master_process on;
  593 ?        Sl     0:00 /usr/bin/qemu-x86_64 /usr/sbin/nginx -g daemon on; master_process on;
  594 ?        Sl     0:00 /usr/bin/qemu-x86_64 /usr/sbin/nginx -g daemon on; master_process on;
  668 ?        Ssl  120:08 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
  705 ?        Ssl    1:27 /usr/bin/qemu-x86_64 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
 6430 ?        Ssl   95:10 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
 6611 ?        Ssl   94:33 /usr/bin/qemu-x86_64 /usr/bin/java -Xmx4G -cp /opt/fuseki/fuseki-server.jar org.apache.jena.fuseki.cmd.FusekiCmd
19124 ?        Ssl    1:10 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19149 ?        Ssl    0:34 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19169 ?        Ssl    0:28 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19194 ?        Ssl    0:21 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19217 ?        Ssl    0:16 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19237 ?        Ssl    0:18 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19261 ?        Ssl    0:04 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19283 pts/1    Ssl    0:00 /usr/bin/qemu-x86_64 /bin/sh
19294 ?        Ssl    0:00 /usr/bin/qemu-x86_64 /usr/bin/java org.springframework.boot.loader.JarLauncher
19315 ?        Rl+    0:00 /usr/bin/ps ax

Are there any log files I can check to see what is going on?

Thanks again for your suggestions.

tuxji commented 3 years ago

So, it sounds like the following problems may be occurring:

  1. A possible networking problem preventing two Java processes (Fuseki and nodeGroupService) from being able to connect to each other or some other reason why nodeGroupService didn't come up and start listening on its port. Our rack-box install script would have called the RACK CLI at a later step, but you didn't get that far yet.
  2. Your problem installing the relevant Python packages in order to build the RACK CLI is also another obstacle. I don't know if that's because some Python packages lack some necessary arm64 or Apple M1 Silicon support (a few Python packages need to include native code binaries which may be what's causing the problems) or you need to use some special flags. A further complication is that when you run Python under MacOS on your MacBook M1, you run/build darwin/arm64 native code, but when you run Python under Docker on your MacBook M1, you run/build linux/arm64 native code. These might not be compatible with each other anyway.

The rack-box Docker image redirects each service's logging to individual files in /var/log/journal named after each service. Your ps ax output shows that the x86 version of RACK was able to start the Docker container's first or "init" process (python systemctl) which was able to start the nginx web server, Fuseki, and the SemTK services. In the same CLI terminal window that you already ran ps ax, please cd into /var/log/journal and look for error messages in the service log files.

You also can examine the same log files in the Packer build by running "packer build -on-error=ask rack-box-docker.json". That option tells Packer to leave the Docker container running while it waits for you to type your reply. That gives you as much time as you need to get a CLI inside the container, cd into /var/log/journal, and look at the log files.

You also may need to run an Ubuntu 20.04 container in Docker, get a CLI inside the container, clone the RACK repository, and run the python commands to build the RACK CLI inside the container so that Python runs and builds linux/arm64 native code instead of darwin/arm64 native code. Then you can tar up the RACK CLI and copy it to your MacBook's RACK/rack-box/files directory so Packer's Docker builder can unpack it in the new image it builds. If that solves one of your problems, we'll have to automate building the RACK CLI with a single command you can run on the MacBook.

Robert-Adelard commented 3 years ago

Hi John,

Thanks for your analysis and suggestions. Here is some information about what I found when I looked in /var/log/journal on my x86 version of the RACK:

Firstly, here is the output of ls -tl to give you an idea of the size of the log files:

-rw-r--r-- 1 root systemd-journal  400519 May 27 13:30 ontologyInfoService.service.log
-rw-r--r-- 1 root systemd-journal 7156696 May 27 13:10 nodeGroupStoreService.service.log
-rw-r--r-- 1 root systemd-journal 1556558 May 27 12:50 utilityService.service.log
-rw-r--r-- 1 root systemd-journal   65379 May 27 12:49 nodeGroupService.service.log
-rw-r--r-- 1 root systemd-journal   59034 May 27 12:35 sparqlQueryService.service.log
-rw-r--r-- 1 root systemd-journal   65602 May 27 12:30 sparqlExtDispatchService.service.log
-rw-r--r-- 1 root systemd-journal   87908 May 27 11:21 sparqlGraphStatusService.service.log
-rw-r--r-- 1 root systemd-journal   72263 May 27 11:20 sparqlGraphIngestionService.service.log
-rw-r--r-- 1 root systemd-journal   15022 May 26 17:54 nodeGroupExecutionService.service.log
-rw-r--r-- 1 root systemd-journal     554 May 26 17:52 configSemTK.service.log
-rw-r--r-- 1 root systemd-journal   14636 May 21 17:09 sparqlGraphResultsService.service.log
-rw-r--r-- 1 root systemd-journal       0 May 21 17:08 unattended-upgrades.service.log
-rw-r--r-- 1 root systemd-journal       0 May 21 17:08 nginx.service.log
-rw-r--r-- 1 root systemd-journal  702554 May  3 17:05 fuseki.service.log

Reviewing the log files, it looks as though the services are starting up repeatedly, often without error - I see lots of entries like this:


  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.6.RELEASE)

2021-05-27 11:15:33.528  INFO 18525 --- [           main] c.g.r.s.s.dispatch.ServiceApplication    : Starting ServiceApplication on 33e90726251c with PID 18525 (/home/ubuntu/semtk-opensource/sparqlExtDispatchService/BOOT-INF/classes starte
d by ubuntu in /home/ubuntu/semtk-opensource/sparqlExtDispatchService)
2021-05-27 11:15:33.735  INFO 18525 --- [           main] c.g.r.s.s.dispatch.ServiceApplication    : No active profile set, falling back to default profiles: default

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.6.RELEASE)

2021-05-27 11:21:46.551  INFO 19077 --- [           main] c.g.r.s.s.dispatch.ServiceApplication    : Starting ServiceApplication on 33e90726251c with PID 19077 (/home/ubuntu/semtk-opensource/sparqlExtDispatchService/BOOT-INF/classes starte
d by ubuntu in /home/ubuntu/semtk-opensource/sparqlExtDispatchService)
2021-05-27 11:21:46.847  INFO 19077 --- [           main] c.g.r.s.s.dispatch.ServiceApplication    : No active profile set, falling back to default profiles: default

However, sometimes there is an error message.

I don't know if this is normal behaviour, but it seems odd.

Looking at the log files in reverse order, sparqlGraphResultsService occasionally falls over because of a missing configuration file:

org.springframework.beans.factory.BeanDefinitionStoreException: Failed to parse configuration class [com.ge.research.semtk.services.results.ServiceApplication]; 
Caused by: java.io.FileNotFoundException: class path resource [com/ge/research/semtk/logging/easyLogger/EasyLogEnabledConfigProperties.class] cannot be opened because it does not exist

configSemTK and nodeGroupExecutionService look fine.

SPARQLgraphIngestionService fails because of a problem with a configuration file:

Error creating bean with name 'apiDocumentationScanner' defined in URL [jar:file:/home/ubuntu/semtk-opensource/sparqlGraphIngestionService/BOOT-INF/lib/springfox-spring-web-
2.9.2.jar!/springfox/documentation/spring/web/scanners/ApiDocumentationScanner.class]

Error creating bean with name 'apiListingScanner' defined in URL [jar:file:/home/ubuntu/semtk-opensource/sparqlGraphIngestionService/BOOT-INF/lib/springfox-spring
-web-2.9.2.jar!/springfox/documentation/spring/web/scanners/ApiListingScanner.class]

Error creating bean with name 'apiModelReader' defined in URL [jar:file:/home/ubuntu/semtk-opensource/sparqlGraphIngestionService/BOOT-INF/lib/springfox-spring-we
b-2.9.2.jar!/springfox/documentation/spring/web/scanners/ApiModelReader.class]

Error creating bean with name 'cachingModelProvider' defined in URL [jar:file:/home/ubuntu/semtk-opensource/sparqlGraphIngestionService/BOOT-INF/lib/springfox-sch
ema-2.9.2.jar!/springfox/documentation/schema/CachingModelProvider.class]

Error creating bean with name 'typeResolver' defined in springfox.documentation.schema.configuration.ModelsConfiguration

Failed to instantiate [com.fasterxml.classmate.TypeResolver]: Factory method 'typeResolver' threw exception

Caused by: java.lang.ClassFormatError: Illegal class name "java/lang/Class[]" in class file springfox/documentation/schema/configuration/ModelsConfiguration$$FastClassBySpringCGLIB$$9f784d07

There are similar error messages in sparqlGraphStatusService.service.log and sparqlExtDispatchService.service.log.

sparqlQueryService fails for a different reason:

Exception encountered during context initialization - cancelling refresh attempt
I/O failure during classpath scanning
Caused by: java.net.MalformedURLException: Jar URL does not contain !/ separator

This is followed by

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000401b4e2670, pid=21855, tid=21923
#
# JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 1607 c2 java.lang.AbstractStringBuilder.append(I)Ljava/lang/AbstractStringBuilder;[thread 21858 also had an error]
 java.base@11.0.11 (55 bytes) @ 0x000000401b4e2670 [0x000000401b4e2640+0x0000000000000030]
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/ubuntu/semtk-opensource/sparqlQueryService/hs_err_pid21855.log

which clearly doesn't look good!

nodeGroupService fails for the same reason (Jar URL does not contain !/ separator), but without the fatal error.

utilityService fails for a different reason:

Exception encountered during context initialization - cancelling refresh attempt
Failed to parse configuration class [com.ge.research.semtk.services.utility.ServiceApplication]
Caused by: java.io.FileNotFoundException: class path resource [org/springframework/context/ApplicationListener.class] cannot be opened because it does not exist

nodeGroupStoreService complains that it can't find a logging implementation, but seems OK otherwise:

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

Finally, ontologyInfoService fails with a variation of the Jar URL problem:

java.lang.IllegalArgumentException: Unable to load @ConditionalOnClass location [META-INF/spring-autoconfigure-metadata.properties]
Caused by: java.net.MalformedURLException: Jar URL does not contain !/ separator

Any thoughts on what all of this might mean? It looks like there might be some kind of Java incompatibility - I've installed various versions of OpenJDK on my MacBook, although I would expect the Docker image to be using its own version of Java:

For what it's worth, my MacBook reports:

$ java --version
openjdk 16.0.1 2021-04-20
OpenJDK Runtime Environment Zulu16.30+15-CA (build 16.0.1+9)
OpenJDK 64-Bit Server VM Zulu16.30+15-CA (build 16.0.1+9, mixed mode)

The RACK reports a different version:

# java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)

Thanks for your help.

tuxji commented 3 years ago

Hi Robert, the version of Java on your MacBook doesn't matter. When you use Packer's Docker builder, it doesn't compile any Java source files or run any Java programs on your MacBook. Rather, it starts up an Ubuntu container and runs an install script to turn the original container into a rack-box image by unpacking various files including already built tar.gz files and jar files and initializing the RACK database. The container, whether it's the original container started by Packer or a new container you start from the rack-box image, uses its own version of java already installed inside the container which is Java 11 (why 11? It's the most recent long term stable version of Java).

When I look at your output, the first thing I notice is how old some of your log files are. Regardless of whether I run a gehighassurance/rack-box:v6.0 or dev image, all my log files have today's date (May 28). Your log files have dates of May 3, 21, and 26 as well as May 27. When troubleshooting, you want to start up a fresh new container so you avoid looking at outdated logs. Remember you can run only one rack-box at a time because each container tries to bind to the same port numbers.

Second, you can expect to see two startups in each log file even after a fresh start up. The install script run by Packer's Docker build started up everything the first time in order to initialize the database before Packer shut down the container and created the rack-box image.

Third, the last two lines of most services' normal error-free startups should look like this:

root@fd1a0d45aec8:/var/log/journal# tail -n 2 *
==> configSemTK.service.log <==
sparqlForm/main-oss/sparqlformconfig.js
sparqlForm/main-oss/KDLEasyLoggerConfig.js

==> fuseki.service.log <==
[2021-05-28 15:55:01] Fuseki     WARN  SPARQL Update: Unrecognized request parameter (ignored): format
[2021-05-28 15:55:01] Fuseki     INFO  [49] 200 OK (84 ms)

==> nginx.service.log <==

==> nodeGroupExecutionService.service.log <==
2021-05-28 15:54:52.671  INFO 253 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12058 (http) with context path ''
2021-05-28 15:54:52.679  INFO 253 --- [           main] c.g.r.s.s.n.ServiceApplication           : Started ServiceApplication in 9.56 seconds (JVM running for 10.562)

==> nodeGroupService.service.log <==
2021-05-28 15:54:56.415  INFO 295 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12059 (http) with context path ''
2021-05-28 15:54:56.432  INFO 295 --- [           main] c.g.r.s.s.n.ServiceApplication           : Started ServiceApplication in 12.316 seconds (JVM running for 13.807)

==> nodeGroupStoreService.service.log <==
 (DONE)
2021-05-28 15:55:00 loading demo...

==> ontologyInfoService.service.log <==
2021-05-28 15:54:50.332  INFO 211 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12057 (http) with context path ''
2021-05-28 15:54:50.339  INFO 211 --- [           main] c.g.r.s.s.o.ServiceApplication           : Started ServiceApplication in 7.966 seconds (JVM running for 8.759)

==> sparqlExtDispatchService.service.log <==
2021-05-28 15:54:58.891  INFO 344 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12053 (http) with context path ''
2021-05-28 15:54:58.940  INFO 344 --- [           main] c.g.r.s.s.dispatch.ServiceApplication    : Started ServiceApplication in 13.188 seconds (JVM running for 15.3)

==> sparqlGraphIngestionService.service.log <==
2021-05-28 15:55:00.169  INFO 380 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12091 (http) with context path ''
2021-05-28 15:55:00.185  INFO 380 --- [           main] c.g.r.s.s.ingestion.ServiceApplication   : Started ServiceApplication in 13.795 seconds (JVM running for 16.03)

==> sparqlGraphResultsService.service.log <==
2021-05-28 15:55:01 Deleted jobs from triplestore: FILTER(?creationTime < '2021-05-28T07:55:01'^^<http://www.w3.org/2001/XMLSchema#dateTime>)
2021-05-28 15:55:01 Clean up about to sleep for 480.0 minutes.

==> sparqlGraphStatusService.service.log <==
2021-05-28 15:55:01.400  INFO 437 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12051 (http) with context path ''
2021-05-28 15:55:01.405  INFO 437 --- [           main] c.g.r.s.s.status.ServiceApplication      : Started ServiceApplication in 12.752 seconds (JVM running for 16.226)

==> sparqlQueryService.service.log <==
2021-05-28 15:55:01.369  INFO 467 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 12050 (http) with context path ''
2021-05-28 15:55:01.375  INFO 467 --- [           main] c.g.r.s.s.sparql.ServiceApplication      : Started ServiceApplication in 11.921 seconds (JVM running for 15.691)

==> unattended-upgrades.service.log <==

==> utilityService.service.log <==
2021-05-28 15:55:02 utility.sparqlServiceServer: 172.17.0.2
2021-05-28 15:55:02 --------------------------------------
root@fd1a0d45aec8:/var/log/journal#

Fourth, I don't have any exceptions or errors in my logs. I only have some warnings in fuseki's log file after starting my v6.0 image; all my other log files are clean.

root@fd1a0d45aec8:/var/log/journal# grep -i exception *
root@fd1a0d45aec8:/var/log/journal# grep -i error *
root@fd1a0d45aec8:/var/log/journal# grep -i warn *
fuseki.service.log:[2021-05-03 17:04:03] Fuseki     WARN  SPARQL Update: Unrecognized request parameter (ignored): format
... (lots of similar lines elided)
fuseki.service.log:[2021-05-28 15:55:01] Fuseki     WARN  SPARQL Update: Unrecognized request parameter (ignored): format
root@fd1a0d45aec8:/var/log/journal#

It really looks like trying to run the x86 version of RACK on your MacBook simply won't work. I would concentrate on trying to build a arm64 version of RACK with Packer's Docker builder. When you run into problems, you can use packer build --on-error=ask and type CLI commands inside the container to troubleshoot problems, retry failed commands with different arguments, etc. Any changes should go back into RACK/rack-box/script/install.sh so all of us can build the rack-box image reproducibly.

Robert-Adelard commented 3 years ago

Hi John,

Thanks for your comments and the additional information. I will restart my Docker image and get a clean set of logs, but I suspect you are right about the x86 version of the RACK simply not working on my MacBook. I read somewhere that OpenJDK and Rosetta (the x86 translation process) do not work well together - if this is the case, then attempts to get the x86 version working are doomed, so it is best to concentrate on building an arm64 version of the RACK as you suggest.

I'll let you know how I get on.

Robert-Adelard commented 3 years ago

Hi John,

I've managed to build a version of the RACK that runs natively on my MacBook M1. However, I ran into a few issues along the way:

  1. I needed to increase the resources allocated to Docker, otherwise, I was getting intermittent "Connection refused" exceptions from localhost:3030 (Fuseki)
  2. I needed to delete the else branch of the if statement in scripts/clean.sh (lines 22-24) because linux-cloud-tools-virtual was not installed (contrary to the comment at lines 13-14)
  3. The instructions to build the RACK documentation do not work, so I have no documentation, just a file that contains 1.0.1.

Please could you check the instructions for building the documentation - I have cloned RACK.wiki, but there is no Welcome.md at the top level, just a _Welcome.md, which doesn't work either. The markdown command runs without error, but the output is a file containing the string 1.0.1, which isn't very helpful!

However, despite the lack of an index page with helpful links, I can run Swagger UI and SPARQLgraph, so my RACK seems to be working.

Thanks.

Robert-Adelard commented 3 years ago

P.S. I note your suggestion from an earlier message:

You also may need to run an Ubuntu 20.04 container in Docker, get a CLI inside the container, clone the RACK repository, and run the python commands to build the RACK CLI inside the container so that Python runs and builds linux/arm64 native code instead of darwin/arm64 native code. Then you can tar up the RACK CLI and copy it to your MacBook's RACK/rack-box/files directory so Packer's Docker builder can unpack it in the new image it builds. If that solves one of your problems, we'll have to automate building the RACK CLI with a single command you can run on the MacBook.

Given that the binary version of the RACK CLI needs to run on a Linux box, I think this step is probably essential. The documentation implicitly assumes that you are building your RACK Box on Linux rather than Windows or MacOS X.

Would it be possible to build the binary version of the RACK CLI on the Linux docker image itself? This would avoid the cross-platform issue.

Robert-Adelard commented 3 years ago

I've also had another look at why the standard x86 RACK image fails if I run it with --platform linux/amd64 and I'm getting a whole series of weird errors during Spring Boor initialisation, for example:

nodeGroupExecutionService.service.log

Exception encountered during context initialization - cancelling refresh attempt: 
Failed to parse configuration class [com.ge.research.semtk.services.nodeGroupExecution.ServiceApplication]; 
class path resource [com/ge/research/semtk/properties/SemtkEndpointProperties.class] cannot be opened because it does not exist

nodeGroupService

Error starting ApplicationContext. 
Application run failed
I/O failure during classpath scanning
Jar URL does not contain !/ separator

nodeGroupStorageService

Error starting ApplicationContext. 
Application run failed
Error creating bean with name 'apiDocumentationScanner' [...]
Error creating bean with name 'apiListingScanner' [...]
Error creating bean with name 'apiModelReader' [...]
Error creating bean with name 'cachingModelProvider' [...]
Error creating bean with name 'defaultModelProvider' [...]
Error creating bean with name 'typeResolver' [...]
Failed to instantiate [com.fasterxml.classmate.TypeResolver]: Factory method 'typeResolver' threw exception; 
java.lang.ClassFormatError-->Illegal class name "java/lang/Class[]" 

I'm aware that Spring Boot does all sorts of weird and wonderful things to bootstrap Java applications - it looks as though these go beyond the scope of what Apple's Rosetta emulation layer can handle.

So it looks as though running an x86 version of the RACK box under emulation is not going to work, and the best solution is to build an ARM version of the RACK box.

tuxji commented 3 years ago

Hi Robert,

  1. How much did you need to increase the resources allocated to Docker?

  2. I've changed the else branch to an elif branch so it will uninstall linux-cloud-tools-virtual only if it was already installed.

  3. I've improved the instructions how to build the RACK documentation - please see the "Package RACK documentation" section.

Given the next release is only a day away, I didn't make more sweeping changes like building the RACK CLI using a Docker image instead of directly on Linux.

Robert-Adelard commented 3 years ago

How much did you need to increase the resources allocated to Docker?

My laptop is relatively new, so I hadn't changed the default settings, which are something like 2 CPUs, 2GB. I increased them to 4 CPUs, 6GB.

I've changed the else branch to an elif branch so it will uninstall linux-cloud-tools-virtual only if it was already installed.

Thanks - that seems to have cured the problem.

I've improved the instructions how to build the RACK documentation - please see the "Package RACK documentation" section.

These now work, although you need to add a -g flag to install the markdown command, as per instructions here:

https://github.com/cwjohan/markdown-to-html#Installation

Given the next release is only a day away, I didn't make more sweeping changes like building the RACK CLI using a Docker image instead of directly on Linux.

I understand, but I don't think this would be a difficult change to make. I'll create a separate issue to track it.

I also think you could automate the creation of the RACK tar files.

Robert-Adelard commented 3 years ago

Oh dear - I spoke too soon. I have an ARM version of the RACK box running on my MacBook, and everything appears to be working as it should, but most of the data is missing.

The failure is very specific and might provide a clue about what has gone wrong - if I ask SPARQL Graph to tell me about all the THINGs in the database, I only get THINGs with a specified object reference, not THINGs with an auto-generated object reference.

For example, I see...

http://arcos.rack/SOFTWARE#SourceFunction

but not

http://semtk.research.ge.com/generated#707c7037-c8b5-4f48-99a6-b31f80b4d700

Any thoughts on what might be wrong? Which log file should provide a clue?

Thanks

tuxji commented 3 years ago

These resource increases are in line with what rack-box/Docker-Hub-README.md recommends, then.

These npm install commands do have -g options:

.github/workflows/actions/download/action.yml
60:        sudo npm install -g github-wikito-converter markdown-to-html

rack-box/README.md
93:    sudo npm install -g github-wikito-converter markdown-to-html

I've added a comment to your issue (thanks for creating it).

tuxji commented 3 years ago

I don't know what might be causing the absence of the auto-generated object references but I hope people more familiar with that part of rack-box like @cuddihyge may have thoughts. Can you get a good set of log files from running an Intel64 rack-box image on a Windows or Linux box and then compare these "known good" log files with your ARM64 rack-box image's log files? I don't know how useful the Fuseki log file is but it might have more details than the other log files. Also, we haven't upgraded Fuseki for a while; it's now 4.0.0 and we are still using 3.16.0. An upgrade could break things even worse, but might help.