dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
187 stars 61 forks source link

Triage database provisioner #840

Closed thcrock closed 3 years ago

thcrock commented 3 years ago

Not quite ready for merge (I would add some documentation around this and maybe clean up the code a bit), but I wanted to get thoughts on this interface for a local Dockerized database provisioner. The idea is Triage users could use this to easily set up a database outside of a Dirtyduck context but without having to do Postgres on their own. With just docker as a dependency, they can:

  1. Bring up the thing from scratch; build the image, run Postgres through the image, and write a database.yaml file using the credentials created by the script.
  2. Start the container if if stopped
  3. Rebuild the container if it was removed

All using the command 'triage db up'.

Note: I don't know what's going on with the Crosstabs CLI but I couldn't get this to run while using those imports. Did that crosstab code get removed? It's obviously unrelated to this PR but if I can't run the CLI with those lines in there I had to take them out.

nanounanue commented 3 years ago

Thanks for this @thcrock

  1. The idea here is to create just a dockerized DB or create the dockerized dirtyduck DB?
  2. I am not a big fan of database.yaml thing (is another file, prone to errors, security risk, etc) but if we want to keep that interface we should support the environment variables (DATABASE_URL or the PG_... ones)
nanounanue commented 3 years ago

@thcrock weird the crosstabs thingy, last week I ran it from the repo without problems. I will check it to track the bug

thcrock commented 3 years ago

It does appear that I'm the only one who's had the crosstabs problem so I will try building my environment again.

As for environment variables: we do support environment variables for database credentials in Triage, but this database provisioner is just creating the credentials, not using them. Any way of attempting to automatically set the environment variables for the user would be shell-specific and OS-specific, and not work until they restart the shell. Do you have something in mind? Just print the credentials to screen and tell the user to set those environment variables?

thcrock commented 3 years ago

@nanounanue about the first question: This came up at the meeting. Ultimately it seems the team wants to use it for Dirty duck (as in, remove the docker-compose and bastion dependencies from Dirty Duck). That would require much more work than is present here. However, we did decide that we can merge this PR, as currently scoped, to just be an option for regular Triage users for now. So you might do this once you finish the tutorial and want to start running Triage on your own data.

shaycrk commented 3 years ago

Yeah, just installed triage from master in a new environment and didn't run into issues with crosstabs in the CLI either.

In terms of testing out here, I did a pip install git+https://github.com/dssg/triage.git@triage_db in a new environment and then tried running triage db up but hit this error:

(temp-3.7.2) code $ triage db up
[ProcessExecutionError] Command line: ['/usr/local/bin/docker', 'build', './dirtyduck/food_db/', '-t', 'triage_db']
Exit code: 1
Stderr:  | unable to prepare context: path "./dirtyduck/food_db/" not found

I think we might need to change it to look for food_db relative to the code path?

thcrock commented 3 years ago

@shaycrk So far I've been unable to get the contents of the dirtyduck folder to show up in the installed package, which is necessary for that docker image to be buildable.

But maybe I don't need to. For this purpose, are PostGIS and the dirtyduck data necessary? I could more easily just have the triage db command use a vanilla Postgres image, at least for now.

nanounanue commented 3 years ago

Manifests are a big problem, I solved this issue in the pastñet me check my notes ...

On Sat, Apr 10, 2021, at 16:07, Tristan Crockett wrote:

@shaycrk https://github.com/shaycrk So far I've been unable to get the contents of the dirtyduck folder to show up in the installed package, which is necessary for that docker image to be buildable.

But maybe I don't need to. For this purpose, are PostGIS and the dirtyduck data necessary? I could more easily just have the triage db command use a vanilla Postgres image, at least for now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dssg/triage/pull/840#issuecomment-817202257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADYXQFGQVSASH2JLV3G6P3TIC4XXANCNFSM42DCBEOA.

thcrock commented 3 years ago

@nanounanue I spent some more time on the Manifest. Apparently the 'dirtyduck' directory is successfully getting into the 'sdist' (what the Manifest.in is supposed to do) but is not getting copied to the site-packages/triage directory. This is despite 'include_package_data' being set to True in setup.py

thcrock commented 3 years ago

@nanounanue Here is the deal with the manifest. The 'dirtyduck' directory wasn't getting into the build directory despite the MANIFEST.in and include_package_data=True combo because it's not within a package directory. If I move it to src/triage, it works.

I tried this based on this reply on a related issue: https://github.com/pypa/sampleproject/issues/30#issuecomment-426993883

Moving it into that directory may change some other things. Before deciding to do this, I think we should confirm that we want the database provisioner to actually include the dirtyduck data scripts. If the user types triage db up, should they get a vanilla working postgres database or should they get the food database sample data (including the one or two minutes it might take for the system to download the inspections data and load it)? @shaycrk

shaycrk commented 3 years ago

Thanks @thcrock! I've definitely been thinking about this mostly as motivated by making the dirty duck tutorial a little more seamless (as well as making it possible to run after a pip install triage without having to clone the repo), though I can certainly see some utility in providing an empty database, too. My sense is it might be nice to support both unless that seems unreasonable (e.g., maybe triage tutorial up vs triage db up?). Anyway, we also have our planning meeting tomorrow, so might be good to talk it through then if you can both make it?

thcrock commented 3 years ago

Hey @shaycrk this is ready for review again, with just the vanilla db.

One thing you'll note is that the docs/sources/index.md changed a bunch. This is because I ran 'manage docs', which copies README.md to docs/sources/index.md. This is something I put in a while ago to solve the problem of either 1. having content split between the documentation site and the README or 2. having the README be a stub and punt the user to the documentation site.

We can iterate on this process depending on what the team wants, but for now it looks README and docs/sources/index.md were out of sync. I'm not sure which version of this page correct. README was modified much more recently than index.md, so I'm guessing README is the right version to use for now, and if that's true this PR syncs them up again. But check it out in case anything is wrong.

shaycrk commented 3 years ago

Thanks @thcrock! Added a few inline comments, mostly around things like docs and naming, but looks good to me overall

thcrock commented 3 years ago

Thanks @shaycrk I just uploaded some changes based on your review

shaycrk commented 3 years ago

Thanks @thcrock -- looks good to me! Feel free to merge if you're happy with it