DataONEorg / slinky

Slinky, the DataONE Graph Store
Apache License 2.0
4 stars 4 forks source link

Feature deployment ordering & Environmental Variables #54

Closed ThomasThelen closed 2 years ago

ThomasThelen commented 2 years ago

Fixes #53 Fixes #51

Deployment Startup Order

This PR should fix any issues around startup order, meaning that we can start the scheduler, workers, database, etc in any order and the services that need particular services to come online will wait. What this looks like is the scheduler and workers both waiting for redis to come online before executing any job logic. The workers also wait for Virtuoso to come online once a connection to Redis is established to prevent any premature SPARQL inserts.

This allows us to decouple startup order from the kubernetes deployment, resulting in a few changes to the deployment files. I left the redis and virtuoso readiness probes in. I don't think there's much harm in letting other containers or developers know that they're not ready for interaction (note that the pods are in the Ready state while the code is waiting on redis).

The deployment order is invariant with respect to readiness probes. The initContainer on the scheduler however, does dictate startup behavior (it won't run the scheduler ENTRYPOINT until redis comes online). Since this logic is now in the scheduler itself, I removed the initContainer.

Environmental Variables

This PR also has the addition of a few environmental variables for specifying hosts and ports. These are stored in the settings and was done in ea6feb4fc07183e2a3a6b7a6b1e0a8d9356d5801.

Production/Development/CLI Environments

The ENVIRONMENTS dict was partially deprecated in the changes above and is fully removed in 3ff13e3e802bc5e8576ccd6317faa824fab1c0da. The SlinkyClient constructor default parameters now come from the environmental variables. To match this change, the calls to SlinkyClient in the cli were refactored.

Changes to Testing

I think that the ideal testing workflow is achieved through setting up pipenv before running pytest to isolate all of the dependancies. The RDF module requirement forces us to break this model because of the manual building & installing that needs to happen for its dependencies. This PR includes an additional container running the d1lod image alongside the graph store in the docker-compose file. This container has all of the test dependencies and can run the unit tests. The pain point is that you have to docker exec in to run them. I could just call pytest in the entrypoint - but the container will exit before the test logs can be checked.

It's not a perfect system, but it at least provides a reliable way to run the tests on any system and I'm open to other options/routes. There are some changes in the unit test fixtures that change the default blazegraph and virtuoso endpoints to the docker service name rather than localhost.

Testing

Testing for this mostly revolved around deploying things at different orders and making sure that the job system still functioned.

Scheduler, Redis, & Worker Tests

I did some manual testing, bringing the three services up in different orders which are bulleted below. In each case I confirmed with the logs that services waited for redis to come online. Because the retry timeout is set for 30 seconds, give the tests at least a minute before the thumbs up/down.

Virtuoso Startup Tests

Remaining Testing

amoeba commented 2 years ago

Thanks @ThomasThelen! I'll have a look this week.

ThomasThelen commented 2 years ago

Running the tests in a container is kludgy-but I don't see any other alternative for anyone that can't get the RDF package installed locally. I think that it should be runnable in both the container and outside of the container, given that the REDIS_HOST, REDIS_PORT, VIRTUOSO_HOST, and VIRTUOSO_PORT environmental variables are set to the right locations.

From the docker image,

Screen Shot 2021-12-08 at 1 04 25 PM
ThomasThelen commented 2 years ago

Actually I take that back. d1lod/tests/conftest.py has hardcoded service names that need to get changed to env vars

amoeba commented 2 years ago

Right, if you can't get the deps installed you can't run the tests. That's fine. To your last statement, I'd really like to be able to run the tests without having to set environmental variables. It makes me think we're missing a default or two somewhere. Is that what you're having to do?

ThomasThelen commented 2 years ago

The defaults are set to production values, which are service names. I left the redis default to None, which is the local graph store though.

One problem I'm seeing is in d1lod/tests/conftest.py with hardcoded service names. The issue is that we run two graphs in tests, but only have one environmental variable for the graph host (and if we want to run these on localhost we want to say the the host is on localhost not at http://blazegraph). I can add a second environmental variable, BLAZEGRAPH_HOST - but then we have unittest requirements trickling up into the deployment logic.

amoeba commented 2 years ago

Gotcha. Thanks for explaining.

I think it generally works better and follows norms if defaults are for local development. Reasoning being that local development is the same for everyone and its production that varies. Also makes it faster to get started and easier to do various tasks if development setup steps are minimal (e.g., I open up ~6 separate shells while developing and setting global env vars or per-shell env vars is just more work).

Re: Helping the tests pick the right graph store: It doesn't feel like we'd need another env var so I agree with you there. Both Stores could use the same env var, have appropriate sensible defaults, and have their tests could manage getting them the environment variables they need (which I think should be zero).

ThomasThelen commented 2 years ago

Re: Helping the tests pick the right graph store: It doesn't feel like we'd need another env var so I agree with you there. Both Stores could use the same env var, have appropriate sensible defaults, and have their tests could manage getting them the environment variables they need (which I think should be zero).

The issue is that both stores need their own environmental variable for tests. For production we only run one graph, so we have the single environmental variable. This causes problems in testing because we run two graphs-and since the second graph can be either at localhost:1234 or http://blazegraph:1234 we need a way to specify. Since they both run under the same instance, they can't share the same address. I can set the default to localhost so that it works for local tests, but then dockerized tests will fail.

ThomasThelen commented 2 years ago

In ebf2c1f I changed the default network location in settings.py to what I think they should be for running things locally. In the docker compose and kubernetes deployment files, they're set to the networked version (http://virtuoso, http://blazegraph, etc).

ThomasThelen commented 2 years ago

@amoeba Commit e3a8e62 should remove the failing unit test that we encountered earlier with the Graph class.

amoeba commented 2 years ago

Great, thanks @ThomasThelen. Do you want me review again?

ThomasThelen commented 2 years ago

Great, thanks @ThomasThelen. Do you want me review again?

I'd appreciate it if you would.

amoeba commented 2 years ago

I ran into two things:

ThomasThelen commented 2 years ago
  1. I have all the tests working while containerized-I'll try and take a look to see what's not working locally.

  2. I can adda flag to the SlinkyClient to use the in-memory store rather than a running service. This can be used by the cli get method to use the local store.

ThomasThelen commented 2 years ago

The redis issue should be resolved in e6a62be

The default store for the cli is now LocalStore, but can be overridden to use the environmental variables with --local False. I ran across a few issues while testing with the LocalStore; I think(I cant test previous versions without the dockerization) that these were pre-existing conditions; I can file issues around each of these if they're in fact bugs that .

To reproduce,

  1. Clone & checkout
  2. Build the d1lod image
  3. Modify the docker-compose file to use the local image build in step 2
  4. run docker-compose up
  5. docker exec into the image
  6. Run the commands below
  7. Run them again with --local false ie slinky insert --local false xyz and see them work

slinky insert urn:uuid:0910e8cf-9a67-43e1-b257-ebd4e3d66bca

root@ce762e064de4:/tmp# slinky insert urn:uuid:0910e8cf-9a67-43e1-b257-ebd4e3d66bca
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 221, in insert
    client.process_dataset(id)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/client.py", line 57, in process_dataset
    self.store.insert_model(model)
TypeError: insert_model() missing 1 required positional argument: 'model'

slinky clear

root@ce762e064de4:/tmp# slinky clear
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 142, in clear
    old_size = client.store.count()
AttributeError: type object 'LocalStore' has no attribute 'count'

slinky count will throw the same

Redis<ConnectionPool<Connection<host=redis,port=6379,db=0>>>
Traceback (most recent call last):
  File "/usr/local/bin/slinky", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/d1lod/cli.py", line 159, in count
    print(client.store.count())
AttributeError: type object 'LocalStore' has no attribute 'count'

slinky insertall (scroll right)

root@ce762e064de4:/tmp# slinky insertall
  0%|                                                                                                                                                                                                                 | 0/130 [00:00<?, ?it/s]Failed to insert doi:10.5063/F16971VP due to query() missing 1 required positional argument: 'query_text'
  1%|█▌                                                                                                                                                                                                       | 1/130 [00:00<00:54,  2.38it/s]Failed to insert doi:10.5063/F1XS5SP4 due to query() missing 1 required positional argument: 'query_text'
  2%|███                                                                                                                                                                                                      | 2/130 [00:00<00:54,  2.36it/s]Failed to insert doi:10.5063/F1T151XN due to query() missing 1 required positional argument: 'query_text'
  2%|████▋                                                                                                                                                                                                    | 3/130 [00:01<00:58,  2.15it/s]Failed to insert urn:uuid:230592c4-cf17-46e7-834b-26fd32731f56 due to query() missing 1 required positional argument: 'query_text'
  3%|██████▏                                                                                                                                                                                                  | 4/130 [00:01<00:47,  2.63it/s]Failed to insert doi:10.5063/F1R20ZMM due to query() missing 1 required positional argument: 'query_text'
  4%|███████▋                                                                                                                                                                                                 | 5/130 [00:02<01:03,  1.98it/s]Failed to insert doi:10.5063/F1BV7DWF due to query() missing 1 required positional argument: 'query_text'
  5%|█████████▎                                                                                                                                                                                               | 6/130 [00:02<00:59,  2.10it/s]Failed to insert doi:10.5063/F1736P5F due to query() missing 1 required positional argument: 'query_text'
  5%|██████████▊                                                                                                                                                                                              | 7/130 [00:03<01:04,  1.92it/s]Failed to insert doi:10.5063/F13B5XD8 due to query() missing 1 required positional argument: 'query_text'
  6%|████████████▎                                                                                                                                                                                            | 8/130 [00:03<01:04,  1.90it/s]Failed to insert doi:10.5063/F1ZK5DXZ due to query() missing 1 required positional argument: 'query_text'
amoeba commented 2 years ago

Thanks for the continued work on this @ThomasThelen.

For commands like clear, count, insert, and insertall, using a the LocalStore doesn't really make sense to me since LocalStore is ephemeral (i.e., thrown away when the command exits). Their intended target is going to be some persistent triple store. On the other hand, get only makes sense with a LocalStore because it's generating triples on the fly.

If the above commands all produce stack traces without --local false, I think we're making the CLI harder to use. Do you think the flag is necessary or could the code just follow the logic in my first paragraph here? Is the explicitness of the flag better? If you do think a flag is necessary, maybe it could be --remote instead of --local false since I think the former pattern is more common for boolean flags.

PS: I made one PR comment that might be a culprit. Does that change any of the above output?

ThomasThelen commented 2 years ago

Alright with 57dd95e, using get will always use the local store. All other commands no longer have the option for using LocalStore. If there's still issues, maybe a complete documentation of what each command is and how it's supposed to be used will help.

Tested using docker-compose and deployed to dev.

amoeba commented 2 years ago

Thanks @ThomasThelen. I tested this out and I had two make two changes, https://github.com/DataONEorg/slinky/pull/54/commits/d6b8067a1036ced78e361c4c13c5fc6bcc8cd050 and https://github.com/DataONEorg/slinky/pull/54/commits/7033442916608d20ea0e70f7ca8cad2e7c8cb36b to make the tests pass locally. They look minor and it doesn't look like they'll affect anything but local development.