NYUCCL / psiTurk

An open platform for science on Amazon Mechanical Turk.
https://psiturk.org
MIT License
277 stars 140 forks source link

Why always the need to set DATABASE_URL env var when running on heroku? #391

Closed dhalpern closed 4 years ago

dhalpern commented 4 years ago

Currently, psiturk_config.py looks for os.environ['DATABASE_URL'] if using Heroku and crashes if it can't find it. This variable doesn't get created unless you create a database when setting up your project which you may not realize if you plan on using your own db. The error thrown is fairly cryptic (just something like DATABASE_URL key can't be found) so it took me a while to figure out what was going on. I think ideally this would just throw a warning (e.g. "you haven't created a Heroku Postgres database for this project, please call heroku addons:create heroku-postgresql unless you intend on using your own db") rather than crash the whole thing?

deargle commented 4 years ago

That only gets checked if the local ON_HEROKU is set. That shouldn't be set on the local development environment. It's intended to be an env var signal for if psiturk is actually running on heroku. https://github.com/NYUCCL/psiTurk/blob/578aa1950762d3b220da3a1036ecdeb715e7fca5/psiturk/psiturk_config.py#L71-L74

If you're saying it crashes on heroku, then yeah it'd be better to have a more descriptive error message. But the default sqlite must not be used on heroku because dynos are not persistent. A persistent database must be set.

I suppose that psiturk could rely on the database url set in config.txt, instead of using an env var as long as it wasn't sqlite or localhost. Ostensibly someone would want to override the database url set by heroku, I guess. Something like prefer env var first for DATABASE_URL, and failing that, check if an okay databaseurl set in config.txt. All (most?) of the config.txt settings should be overridable via env var.

On a bigger level, psiturk needs more user-friendly heroku integration.

dhalpern commented 4 years ago

Yeah, I had ON_HEROKU set and was trying to use a preexisting mysql table so it felt like a case where it shouldn't throw an error. The solution of checking env vars first and then checking the config file sounds good to me, can try to make a pull request at some point.

dhalpern commented 4 years ago

Actually, maybe I didn't understand this before but it actually uses the heroku database even if the database in config is different. It seems to me that the one in config should be prioritized...

deargle commented 4 years ago

As an example of best practice, boto3 prefers env vars over config file: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html

Also, with pull requests, please follow the GitHub Flow, where you create a separate branch off of master on your fork for each PR

On Thu, Jan 23, 2020, 6:13 AM dhalpern notifications@github.com wrote:

Actually, maybe I didn't understand this before but it actually uses the heroku database even if the database in config is different. It seems to me that the one in config should be prioritized...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NYUCCL/psiTurk/issues/391?email_source=notifications&email_token=AAI6Y7O66MXY65CR7UD4TZ3Q7GJZBA5CNFSM4KKNWLM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXKHFQ#issuecomment-577676182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI6Y7NC7XQGOJOP4M4LDS3Q7GJZBANCNFSM4KKNWLMQ .

dhalpern commented 4 years ago

Hmm this seems a bit unintuitive though if you are coming from using psiturk through the ad server where the database in the config file is the one that gets used. In addition, you still need to use a config file even if you set the database in env vars so it seems more straightforward to me to prioritize the config file?

And sorry about screwing up GitHub Flow before, just resubmitted the PR as a separate branch to keep things cleaner.

deargle commented 4 years ago

In addition, you still need to use a config file even if you set the database in env vars so it seems more straightforward to me to prioritize the config file?

Not necessarily -- in theory, the psiturk config file could be present, but completely empty, and psiturk would run. Psiturk uses ConfigParser library, which will fall back on defaults for any entry not present in the user's config.txt. https://github.com/NYUCCL/psiTurk/blob/8530909c75de4a9b3c0d5969dc5452500b2969c7/psiturk/psiturk_config.py#L60-L61

This is mostly how other software packages that I am aware of work also. Most config settings have defaults which can be overridden in a user config.

env variables are more deliberately set, whereas a conf file could have an inadvertent default setting. The master branch of psiturk is the WIP for v3.0, which will have big big changes, including dropping support for the ad server. It's been possible with psiturk v2 for a while now to not use the psiturk ad server. I want to (1) make it easier to use psiturk on other cloud providers, and (2) generalize it away from heroku. e.g., remove logic flags in the code like if ON_HEROKU and replace with a generalized flag to stream error logs, which is the only other place that flag gets used besides the DATABASE_URL check . Without ON_HEROKU, the value on config.txt gets used, which is almost certainly the default server.log, but which no one is going to know how to read because they won't know how to read the server.log sitting on the heorku dyno. running heroku logs is really the only sane option. https://github.com/NYUCCL/psiTurk/blob/05a20329b27e1e677413f689f56e1e0e607019bc/psiturk/experiment_server.py#L96-L100

https://github.com/NYUCCL/psiTurk/blob/05a20329b27e1e677413f689f56e1e0e607019bc/psiturk/experiment_server.py#L78

My purpose with that logic was to be able to handle logs differently from what's in config.txt when running on heroku. That's one of the great uses of env vars -- differential configs depending on running environment. That whole logic in the example above can be replaced with an env_var that is only set on the heroku dyno which specifies the log location. Could be made part of the set-heroku-vars whatever script (which I'm going to move in to be a shell command instead of a standalone py script).

Back to your comment though,

Hmm this seems a bit unintuitive though if you are coming from using psiturk through the ad server where the database in the config file is the one that gets used.

More generally, the logic with the best practice is: if a user set an env var, then they really must want it, so use it. In the specific case of heroku, if a user installed the postgresql addon (which sets DATABASE_URL env var), then certainly they must intend to use it. I think for your use case of not wanting to use postgresql, but rather a mysql url, we can _remove the check to see if ON_HEROKU is set before overriding the database setting with the env var. Rather, if the env var is set for DATABASE_URL, then use it. Otherwise, use config.txt. But again, I am fearing the case where users forget to change away from sqlite, and then they are unable to access their data. I'm not sure how to write logic in psiturk to check "am I running in an ephemeral container, if so, don't use sqlite or localhost database."

tldr; I argue for removing heroku-specific ON_HEROKU env-var checking, and instead, using DATABASE_URL if it's present, but otherwise, fall back on config.txt setting.

again, this is the master branch for v3, and if you're wanting to work around this problem now, you could just do heroku config:set DATABASE_URL=<your mysql url>. I'm hesitant to make a big code change like this on the v2 psiturk branch.

deargle commented 4 years ago

One additional thought in favor of moving away from database_url in config.txt -- it's ideal to not include the database_url in config.txt since it contains username/password, which would end up getting committed to source control. Env vars are better for that sort of thing.

dhalpern commented 4 years ago

ok understood, I'll close the relevant pull request as well.