You can see jobs and filter by status here (the default filters aren't useful and apparently setting them doesn't affect the URL, so you have to select the queue and change the filters every time). That has a "total run time" column, which is useful for isolating jobs that ran a long time but ultimately failed.
When you click into the details for a job, there's a "Status reason" field in the "Job information" box. Some of those say "AnalysisJob terminated via API by rebecca@peopleforbikes.org", so those don't count as failures.
Ones that say "Essential container in task exited" with status FAILED, like this one, are the interesting ones in terms of trying to identify problems.
That same "Job information" panel on the detail view has a "Log stream name" field that links to the logs for that task. If you click through to the logs for that failed one, the spot where things went wrong says:
psql:../features/paths.sql:59: server closed the connection unexpectedly
This probably means the server terminated abnormally
We may need to get postgres to log to stdout in order to see a better error message since the logs are in the Postgres container, which doesn't get persisted
this may be difficult because the pg container runs as a background process
Notes from Klaas: