datamade / how-to

đź“š Doing all sorts of things, the DataMade way
MIT License
84 stars 12 forks source link

Alternatives to custom data management interfaces #35

Open jeancochrane opened 4 years ago

jeancochrane commented 4 years ago

Clients often ask us to build them a custom data management interface (CPS SSCE, SFM, Erikson, BGA, etc.). Very rarely does the client wind up using the interface successfully without substantial help from DataMade.

Investigate alternative approaches to this problem by choosing an approach and refactoring an existing project to use a different data CMS. Some possible approaches might be:

fgregg commented 4 years ago

LOVE IT!

hancush commented 4 years ago

I stumbled across this one time and was intrigued. https://github.com/wq/django-data-wizard

Django Data Wizard is an interactive tool for mapping tabular data (e.g. Excel, CSV, XML, JSON) into a normalized database structure via Django REST Framework and wq.io.

jeancochrane commented 4 years ago

I took a quick look at django-data-wizard in the UofM case. I didn't have much success, but it's possible I'm just not understanding it.

The major problem I ran into is a non-intuitive workflow coupled with lack of docs. The only documentation I could find is the README of the repo; it provides some screenshots of the app under an "API documentation" section, but it doesn't actually explain what the major entities are or how to configure them to perform an upload from start to finish. There are a couple of different entities available for editing, including Identifiers, Runs, Serializers, and Loaders, and after reading the docs and poking around for half an hour I still don't understand how they relate to one another.

Another issue is that the docs appear to imply that the data wizard assumes a 1:1 relationship between a spreadsheet and a particular model instance:

The default FileLoader can be used with any Django model with a FileField named file.

This would mean that we can't upload additional spreadsheets and use them to create new entities. It also would mean that a spreadsheet has to fully encompass the attributes of the model instance. We might be able to get around these requirements in the UofM case, but it seems restrictive and non-intuitive to me.

One thing I did really like is that the uploader can use Celery as a task queue backend, but by default it actually just uses Python threads. This is pretty compelling to me, particularly in a containerized deployment context where we'd need entirely separate Celery and Redis containers if we wanted to run a queue. Multiprocessing may actually be a better choice than threads if we can pull off the complexity, but I'm definitely inspired to try this approach for UofM.

jeancochrane commented 4 years ago

Note that we're trialling Airflow for LA Metro, we may have different thoughts about this after that project is complete.

hancush commented 4 years ago

Barring an all-in-one solution, I think it would be useful to chunk off work from this issue based on pain points in the past. One big one is schema validation and, beyond that, useful error messages for users when their data doesn't meet expectations. This was a part of the IHS data import, and will play a role in the Dedupe.io API refactor and BGA Salaries Phase II.

I'd like to spend a day working with the schema validation libraries I outline here [internal link], selecting the one that's most comfortable, and setting out a few patterns for common operations, e.g., defining a schema and handling exceptions when expectations are violated.


Separate, but related: @jeancochrane mentioned schema locking and foreign key import as functionality to investigate in Airtable, potentially as a replacement for the Google Sheets workflow we've used a couple of times.

hancush commented 4 years ago

There are a lot of really smart things going on in @jeancochrane's importer for the Minnesota Elections Archive: https://github.com/datamade/mn-election-archive

Chief among them is a nice mechanism for logging and reporting errors to the user. I also like that they defined extract, transform, and load management commands, so the processing steps remain separate, but we can still tackle them in Python rather than teeing off a Make process.

Overall, the code is simple, elegant, well-organized, and, as a result, easy to follow and use!

I'm curious how the client will find this interface. If he likes it, I'd be interested in using some research time to spike out some of these patterns in the payroll import. It'd be super high value to the client to be able to manage that data themselves on a recurring and reliable basis.

jeancochrane commented 4 years ago

We might consider taking a look at https://django-import-export.readthedocs.io/en/latest/.

jeancochrane commented 4 years ago

@fgregg will write up some thoughts on splitting this up into the key requirements for a system.

fgregg commented 4 years ago

so, the client wants a custom data management system. This has typically meant two things.

  1. The client wants a process for staff to capture and edit small pieces of structured information
  2. The client wants to be able to bulk add or update information, typically from a spreadsheet.
  3. What other modes are there?

Let's first talk about 2.

Here's my imagination of what the client wants.

TK

beamalsky commented 4 years ago

I had a case matching scenario 1 from @fgregg's comment above with Lugar. We have a Congress model in the app with inactive_days and footnote attributes, and I wanted to give Lugar staff a way to directly edit both. This was for existing Congress objects, but would also work for adding and deleting them one at a time.

This access was quick and easy to set up with Wagtail's native ModelAdmin module. You can see that implementation here. I added a section on Editing Congresses to the existing CMS instructions, and they were able to get it working with no further support. It was great to be able to expand an editing interface that Lugar is already comfortable using, and ModelAdmin blends in well with the rest of the CMS.

If they'd needed a bulk import process of any kind, this would have been more complex. It looks like ModelAdmin can support foreign key relationships, but I haven't tried to set that up. For the relatively simple need of editing a couple fields on a single model for a Wagtail site, this approach worked great.

jeancochrane commented 4 years ago

@beamalsky Is investigating Airtable with South Side Weekly; if it happens, we'll see if there are any lessons for this issue.

beamalsky commented 4 years ago

I made SSW a Gatsby site with an Airtable backend and it worked beautifully! Repo here and final site here.

This was a case that falls under category #1 of Forest's comment above.

The Weekly is already very comfortable using Airtable so this stack was ideal for them. gatsby-source-airtable is a sophisticated Gatsby plugin that worked pretty much out of the box for pulling data in from the Airtable API, and I was able to get photo upload and rich text columns (!) working smoothly. Airtable's method of storing photo as attachments in a spreadsheet cell is kind of a mind bender but worked well with gatsby-image to take advantage of their built in photo features like auto resizing and loading animations. The rich text processing was also simple to set up with gatsby-transformer-remark but is finicky—a misplaced space character could easily break the formatting.

This is what the backend Airtable looks like:

Screen Shot 2020-06-25 at 10 47 15 AM

Unlike a CMS like Wagtail, there's no easy way for users to update the Airtable and have it reflected immediately on the site. This is also a limitation of something like Google Sheets, and for the CPS SSCE dashboard we've just set up a crontask that pulls in new data from the sheet every night. While the Weekly was actively editing the Airtable with new content I set up a Github Action to rebuild the site every 5 minutes, but that would run through Netlify's free build minutes very quickly if always left on. In retrospect, I could have given them access to the Netlify account and shown them how to rebuild the site when they wanted to see an update. I think that approach would be worth trying for a relatively tech-comfortable client but would probably be alienating to others.

An attractive part of Airtable is that it has a concept of foreign keys to link multiple tables. It seems like this is better for us than something like Google Sheets, which has no built in way to link records, but that the Airtable API can't currently return true nested GraphQL queries—the relationships get lost and the foreign key field just returns an ID. I didn't need multiple tables this project and haven't explored this deeply but it would be key for adopting Airtable in a project with a more complex data structure. David Eads is currently exploring BaseQL to improve on the Airtable API handling of foreign keys.