jdalt commented 4 years ago

To better serve our users and our Lizards, we need a richer way to create, update, and publish data.

Background

Our backend is currently Google Sheets. It actually works pretty well. It’s perfect for hackathons. However we’re outgrowing it.

Google Sheets Problems

Editing data in a massive single row is hard on the eyes and the brain.
The sheet is fragile - we’ve broke the public sheet several times. Migrations require down time.
Keeping certain data private to Lizards is tricky. The published Spread Sheet requires multiple import statements to handle gaps in the columns.
We want to model more things than we’re currently able to model - sites hours for each day of the week, services offered, a validated list items offered and accepted, and site contact information for internal use.
Providing self serve data updates by site operators/representatives would be difficult. It would probably require and fairly complex API shim do it in Sheets. The Twilio system gets us closer to self serve, but we could take that even further with a real backend.

These problems are a big deal and they limit what we can do to serve the community. For this reason we need a real backend.

However I also want us to take a moment to remember what Google Sheets provides a solution for:

Google Sheets Solutions

Query API that scales (web server, caching)
Collaborative data editing (interface + database)
Authentication

These solutions are a big deal and any backend technology that can replace Sheets has to solve 1-3 at a high level of quality before we can even begin to solve for the problems. Notably we need a backend that can absorb a fairly large amount of traffic. We need backend that we can train a lot of volunteers on fairly fast.

In imagining our backend we also need to think about how we'll maintain it, where we'll host it, how we'll pay for that hosting, and how we'll teach and mentor new developers as we build it.

Any option needs to answer for what google sheets solves, the new problems that we want to solve, and the operational concerns of running it.

Not let's talk about our options.

JamesMGreene commented 4 years ago

I don't have firsthand experience with it but I've heard good things about using [Airtable]() as a relational database with a spreadsheet-like UI, an API with Node and browser SDKs, and apps for Windows, Mac, iOS, and Android.

This project would probably also be fine with the free plan as far as I can tell (max 1200 records per database, unlimited databases), though they do at least offer a 50% discount for Plus/Pro plans for nonprofits if that is deemed necessary. 🤔 🤷‍♂️

Some advantages and disadvantages to kick discussion off:

Advantages:

Relational database will allow for easily building linked relationships between separate "sheets"
Spreadsheet-like UI will hopefully keep data entry easy
Has an API
Has an SDK for both Node.js and browser JS
Free plan will probably suffice; a 50% discount is also available to nonprofits for Plus/Pro plans
Available apps for Windows, Mac, iOS, and Android to make administration even more convenient

Disadvantages:

Relies on a database hosted by Airtable
- Cannot setup a local database for testing
- Cannot be hosted independently
If the free plan does not suffice, the per-user cost could add up quickly despite the available discount 😬
Vendor lock-in, can only be exported to CSV

ThePrismSystem commented 4 years ago

I wanted to throw out another option that I think is worth exploring. Directus is an open source headless CMS with both an admin interface and an API "out-of-the-box". The feature set that it comes with is very in-depth, and it supports multi-tenant user permissions with publishing restrictions (rare for an open-source headless CMS from my experience). Docs are here. It even supports adding custom field types and interfaces with a well thought-out methodology.

I spun up a small DigitalOcean Droplet, and created collections that replicate the current Google Sheet structure fairly well. Added 4 of the locations as examples. Details are below for anyone to review:

Admin Interface Login

Owned Location(s) Admin Credentials - This user role is set up to only be able to see and edit their own location(s), along with creating new locations.

Email: single-location@user.com
Password: gFNxBv4$OMYSyd25Yg

All Locations Admin Credentials - This user role is set up to be able to see and edit any location, along with creating new locations.

Email: admin@user.com
Password: gbDTF#bwZ60Tu28#hk

With the 4 locations that are currently added, an example API output from the RESTful API is here.

I feel like this product is "fully-baked" enough to satisfy a lot of the basic requirements (security, user permissions, DB structure, RESTful API, etc.), but extensible enough to allow the project to customize as much as needed. I have some ideas related to hosting as well, but will need to detail those out tomorrow.

Interested to hear others' thoughts after looking at the docs + the example above.

jdalt commented 3 years ago

Proposed technology:

PostGraphile

Approach: I think we should use PostGraphile to build a thin GraphQL application layer in front of a PostgreSQL database. In the near term we can migrate the data from the GoogleSheet to the PostgreSQL db on a 1-10 minute basis. We should build a Lizard edit interface for the new Backend in React to generate our preferred Frontend architecture. Once the new Lizard edit interface is ready, we should decommission the Google Sheet and run completely on top of the new backend.

Advantages:

High data modeling flexibility with low application layer complexity. We need a backend for data modeling and PostgreSQL is quite possibly the most flexible and robust Open Source data store in the world.
We can use Netlify's lambda functions to deploy the application layer and keep our monitoring and operational complexity low.
The project will be easier to fork to other cities while running "mostly" on Netlify. We could make this extremely forkable if we made the database multi-tenant.
Building skills in PostgreSQL and deferring more logic to the database allows us to turn some of our Lizards into PostgreSQL DBAs. We have a lot of skills in Google Sheets and future of the backend should continue developing those skills. Skills in PostgreSQL are very valuable general skills that should benefit anyone that develops them.
Delegating as much work as possible to the database matches our current skills better than a thick application layer. Help at the application layer is spiky whereas Lizard help is fairly constant.

Disadvantages

We need to operationalize our own PostgreSQL database. There will be additional monthly cost.
This is a new technology and we don't have a huge skill base in it.
We'll need to eventually build an edit interface for Lizards (long term advantage, short/medium term disadvantage).
We'll need a data migration that can run to constantly to fill the db until we have a Lizard edit interface. There's going to be a lot discovery in what the nuts and bolts of the Lizard process is in Google Sheets and we need time to develop it.

I think the advantages of flexibility, mentorship, operational costs, and the potential forkability are a pretty good set of tradeoffs. My preferences bias me towards data modeling. I think this solution has higher up front costs, but a really good long term set of costs tradeoffs. I also think it's very important to consider that the technologies we choose are the skills that we'll be building and mentoring in the collective. These skills should be flexible and valuable in all our future professional, open source, and civic hacking endeavors. That's why I'm willing to sacrifice higher up front costs for long term flexibility at what I hope is a low long term maintenance cost. Relative to the other proposals this is more databasey, less cmsy, and more open source. There's more initial hand rolling (database provisioning, Netlify work) but I think that will be buried rather than enduring complexity.

adyates commented 3 years ago

There's a lot of interesting research here, but I think the discussion has started to drift away from what looks like the central problem which seems like it's this:

How do we make a flexible data layer while net positively impacting both Lizards and site stability, given tight resource constraints?

So far, most of these are talking about technology against abstract requirements without deeply connecting to the systems that exist now.

To that end, your costs (e.g. time, human capital, because it's what exists) probably get grouped to something like this:

Hosting: If you need to transition to something hand rolled, Hosting isn't as big of an issue. Heroku's free tier should be more than enough to handle immediate needs on a regional basis, and Teams is free for up to 5 admin users with any number of collaborators (if needed). New Relic also now has a Free Forever option if you want an actual alerting stack (for one user, but I suspect there is some amount of password sharing going on).
Datastore Development: Transitioning/Syncing from a CSV to a database is not particularly difficult. This is junior data engineering work, and there are a number of tools that could do this, some even for free, based on usage
API Development: This will go without saying as a cost if you choose to adopt a new database (otherwise you can't interact with it), but I'm sure there are ways to minimize it. Moreover, the API itself can (and should) be done independently of any of the Sheets migration efforts. This should consider both the time to generate the API as well as the time it takes to document and use it well.
UI Development: You're going against the collaborative abilities of G Suite. It's not best in class, but you're going up against the engineering efforts of a dedicated product team if you want to enable in-app collaboration. Viewing information is fairly okay, but collaborative editing is rough.
Retraining: Transitioning the Lizards to any new UI will take time, regardless of what the UI is. I can't comment on "churn" of Lizards vs Devs, but you will be throwing out all of the knowledge built up around Sheets in order to do this as well.

There's also three systems to consider (as I've been able to derive so far from @mc-funk and the OP):

Master Table (for Lizards)
The derived Public Table (for the site)
(TBD) Self-management of profiles

Any solution that requires swapping out both at the same time is probably a not a good idea.

I generally have no real opinion about GraphQL vs REST/JSON/HTTP APIs, except for this: GraphQL solutions inherently pass the responsibility of data and schema to the client and require a silly amount of operational overhead on the backend to monitor on an operational level because you're effectively passing the responsibility to the UI to make sane queries through to a datastore layer. Standard JSON APIs, especially if you already know what you're returning and to who, are far easier to reason about, document, and monitor.

Lastly, although I appreciate the skills growth aspect of the overall effort, I don't think "opportunity for growth" is a reasonable justification for operational complexity or using a technology. There's a reason people still hack out marketing sites with nothing more than jQuery. So I'm also discounting that when I do my analysis below.

So, that said, I'll post some up front thoughts about each system as it relates to each solution for level-of-effort vs value, then I'll voice some opinions on what has been proposed so far on a per-layer basis:

Master Table

Airtable For the most collaborative part of the process, with multiple editors, Airtable honestly beats out the other options in terms of configurable views and the ability to limit certain forms of access.

It can strip out the UI clutter and hosting concerns, preserves most of the expected functionality of working in spreadsheets, while also eliminating some of the major pain sources derived from needing two sheets.

Although Directus is pretty legit, the one thing I worry about is the sheer complexity in how the data layer itself operates.

Handrolling a DB, API, and UI to serve as a master profile where we have extremely limited numbers of rows and there is a concern about developer availability / support seems a bit much.

Both of non-Airtable options also add actual operational overhead to deployments that can likely only be taken on by developers. That's a real risk.

Public Table

Airtable Airtable has the native ability to make this as a published custom View where all of the unnecessary fields are hidden. If you also want this to be locked from only the most Special of collaborators, it does require a Free version.

If you want the flexibility of a custom UI, you can still make the API layer without a DB simply by shimming against Airtable (Shims are not inherently bad, otherwise many public GraphQL implementations would be bad by default).

Both Postgraphile and Directus would require you to build the entire UI from scratch up front.

Unless I'm missing something, the API for any public version would generally not need to change too much and consists of only minor variations on the endpoint (assuming client-side filtering).

Self-editing Profiles

Handroll It's probably easiest to simply make a simple app that allows someone to "claim" a row, and use that app to make API calls that modify the particular row they have access to. Authentication doesn't have to be tied to the main data source, and it simply becomes a question of what fields to expose and how, and this definitely doesn't need to store state.

If you wanted to, you could go the Postgraphile approach. However, the data structure isn't complicated and you would still need to do some form of authentication, so I'd actually recommend (in rare form to be frank) just building on top of Firebase (or, if you still wanted Postgres, Auth0 + Heroku's free tier).

I don't think using Airtable's User management makes too much sense in actual application. It would be awkward to set up user-specific custom views to restrict to a particular row and particular fields in that row.

It seems possible in Directus based on what @Jaszzz posted above, but I wasn't able to see exactly how that was configured on the backend to enable that. Trying to figure it out from the public demo was less than illuminating, and I wonder if it would be as complicated to do as it would be via Airtable.

Airtable

Airtable probably solves a lot of the underlying about Sheets (data protection and integrity via enforced schemas, protected data) while also preserving a lot of the basic things that Sheet's provides by the nature of what it is (collaborative editing, auth, "can look like a spreadsheet"). However, mitigating the cost is probably the largest factor, as you have to stay within the feature set of Basic unless we can get an exemption (with the arrival of Free Pro for COVID-19 uses, there may be a way to negotiate this).

If we somehow get Pro, the feature set on this becomes overwhelming in power and possibility, but I'm only considering this from the Basic version for now.

UI

It has a number of views, in addition to spreadsheet view, that make cutting out excessive clutter for certain groups of related fields easier. For Lizards, this seems like it would be a UI win. If you're willing to take the risk of anyone with edit permissions on the Base being able to modify it, you can also create an embedded view for the site under the Basic plan.

API

It has SDKs already and the API (and associated documentation) is generated by using the field names in the column for any Bases you provision with nominal PATCH support.

Datastore and Modeling

It adds more structure while also allowing for typed fields. At the end of the day, it's a database. It is proprietary and doesn't lend itself to the same level of backup/restore/migration, but CSV export is plenty.

It also has a basic 2 weeks of snapshot/revision history, which shield us from accidentally deleting data or doing anything crazy.

It's worth pointing out that the use of Custom Views in the UI lends this most easily to visually "making domain specific-tables" more so than other options, even if under the hood that is decidedly not the case.

Major downside is that migration/syncing would take some effort in this case, because the only thing we have access to is the API. It's not the worst thing, but bears consideration.

Hosting

It's a SaaS product, so hosting and uptime is as stable as Airtable.

Retraining

Of all the options, this is the closest to Google Sheets with the option for alternative views. Most of the issues will likely revolve around the rows are now connected to each other (as opposed to just being organized text in Sheets in most cases) and the UI being so drastically different.

Directus

It actually looks fairly compelling. Much like Airtable, it has a pre-built UI that would allow Lizards to do the work of editing and creating new locations. Functionally, it would look more like a directory entry, but that isn't all bad as form entry is fairly intuitive as a concept.

UI

You get a default UI out of the box for all the administration on Collections that you need. However, you still end up with the same issue as sheets: Expanding dataset means expanded page. It may not be a problem in itself, but it is a thing.

There is a Layouts concept in the application that can change macro level formatting, but you cannot save a Layout that only contains a subset of information for later use (i.e. only view location hours for all sites).

There is also no direct embedding option.

API

It doesn't have an SDK, however the API is generated by using the field names for the Collections and supports both GraphQL and RESTful style interacing.

Enabling self-management outside of Pro would likely need a separate API layer. But a dedicated API for something that focused (and low traffic) is easily hosted for free just about anywhere.

Datastore and Modeling

It's actually running on top of MySQL / ZendDB, but generally the Collections are what they are. There's a staggeringly deep type system on the fields (you can define password hash fields, for example), but the documentation on how to use the fields (e.g. Translation) is not particularly documented and may require a fair amount of knowledge to figure out what some of the options even mean, much less how to handle them.

I don't know how any concepts like data rollback would work if someone makes a whoopsie and deletes an entire field off a Collection or we screw up a schema change. I suspect you have to do much of the same backup/rollback work at the DB level. No idea how this would work if it was hosted by Directus itself.

Because the DB schema is fairly exposed and may have underlying complexity, migrating/syncing from the sheets into the DB on an ongoing basis may be a problem if you miss hidden connectivity. That may be relegated to an API-based activity, much like Airtable.

Hosting

Either self-hosting or paid. Self-hosting would require the normal DB + Web server provisioning. As a result, you would want to run some additional monitoring on top of it.

Retraining

I suspect that the basic data entry part of this is okay. After all, it's essentially a CMS.

However, administering the Collections and adding new fields may become an adventure in documentation and community support, since the admin interface makes no bones about exposing the complexity of the underlying DB mechanics from the user. That creates a reasonably high bar for accessibility when you have to figure out what the one-to-many relationship between Collection fields can or should be, much less try to onboard someone into that.

Postgraphile

This is probably one of the most hand-rolled options available, even though it's effectively driven by code gen. Although some of the technology is "provided", all the the underlying work needs to be done by devs.

UI

It doesn't exist, so we would need to develop everything from user flows to expected interactions from scratch. Given that we do have notions around Dates, Times, and two current views (with a look at a third), there's a lot of surface area to cover to make this on part with any other solution.

Specific frameworks (e.g. React, jQuery, Vue) aren't particularly important for the purposes of this. But building and maintaining a low-issue, cross-browser UI is a non-trivial amount of effort. Regardless of which path you choose, you're looking at at least three separate views.

API

Although we get a GraphQL API out of the box, it's being generated from our DB schema. I'm not sure that you want to couple the schema directly to the API without abstraction, just as a general rule. I'm also not totally sure if

However, the API generated is of the form of a web app, not of the form of lambdas, based on what I can tell. If that's the case, using Netlify seems like it would be a bit of a hack, so we would want to use a different hosting option if we want to minimize the surface area of code we're responsible for maintaining.

Also, you still need to roll an entire authentication option here. Even if you pick something like Cloud Identity or Auth0, that's still a user admin interface and an RBAC that needs to be stood up. This is non-trivial if you already know you need row-specific permissions (e.g. to enable self management).

Datastore and Modeling

It's PostgresQL, so there's a lot of room for types and some interesting capabilities with full-text search and PostGIS enabled. However, it's entirely reliant on our ability to model the existing data based on what we have done so far. We would also be responsible for figuring out how schema migrations, backups, and rollbacks would work.

On the flip side, because we control the schema, the initial migration/syncing is probably the most transparent and reasonable for basic ETL tools and scripted dump/restore compared to the SaaS offerings above.

Hosting

As mentioned above, you can do much of this with Heroku's free tier at present. It still requires hosting, however. Heroku does have some basic monitoring out of the box, but nothing at an endpoint-specific level.

Retraining

Rebuilding a spreadsheet UI is likely a non-starter question (i.e. if we need to do this, this is probably a key indicator we are doing the wrong thing), and there is a lot of discovery work that would need to be done if we deviate too drastically from a basic directory view and start trying to codify workflow in the UI.

AndrewReitz commented 3 years ago

I would like to suggest squarespace. https://www.squarespace.com/

It's the easiest way I know of to make a website, and uptime and hosting is all on them so we wouldn't have to worry about all of that. It's can be bit of a pain if you want to do very customized stuff, but I think it might work well for what we are looking to do.

Twin-Cities-Mutual-Aid / twin-cities-aid-distribution-locations

Technical Evolution - Why We Need a Backend #174

Background

Google Sheets Problems

Google Sheets Solutions

PostGraphile

Master Table

Public Table

Self-editing Profiles

Airtable

UI

API

Datastore and Modeling

Hosting

Retraining

Directus

UI

API

Datastore and Modeling

Hosting

Retraining

Postgraphile

UI

API

Datastore and Modeling

Hosting

Retraining