rye commented 6 years ago

Premise

Our current data entry style for data sucks. (Okay, it's not that bad, but I have a love-hate relationship with it which is leaning more towards the "hate" side of that spectrum.) Other contributors have voiced their opinions about it, which have ranged from neutral responses to fiery contempt.

What is "data"?

Our data is broken down into two categories: live and not-live. Examples of "live" data include: bus schedules, (and, potentially, bus locations) and building hours. These are generally updated in batches with no clear diff to be followed and are a pain in the ass to update by hand. Not-live data are more nuanced, such as pictures of objects or places, definitions, and the like.

Current data management strategy

Data exist as .yaml, .md, etc. in a folder called data/ and are bundled whenever the app is built. They are also fetched from GitHub Pages, but the fallback is to whatever was bundled with the app. This means that a User may end up getting data that is up to six months old if, for instance, they are using v2.5.2 and in a zone with poor network performance.

In order to get updated in production, data files must be changed through the normal PR approval process, which consistently takes >10 minutes to build, and probably longer to get approved.

Proposal

I propose the following changes to the data strategy for this app.

First, we build an external data source which is completely decoupled from this app repository. One candidate is api.frogpond.tech, which already exists.
- As part of this change, we adjust the functionality of the "suggest building hours" interface to instead send requests to this external data source.
- If the external data source is a server, (which I prefer) we should send a POST request.
- If the external data source is still a Git repository, we still need a seamless way of having suggestions be queued for merging.
  - One option is to build a GitHub/GitLab bot which automatically takes POST requests and turns them into PRs.
We update the built-in in-app fetching to persist newer data and have it replace older data. This way, if a client loses connectivity, they still have somewhat up-to-date hours.
We update all urls in the app to point to this new data source and verify that it is end-to-end functionally identical.
We release v2.6, making these changes widespread. (Timeline: by the beginning of 2018 Q4, i.e. September.)
- -

Questions? Comments? I would like to start moving forward with this idea beyond the nice-to-have phase.

hawkrives commented 6 years ago

Before we get into the technical discussions, I’d like to outline what I’d like in a server solution (ignoring how the data is actually stored on said server; it could just as easily be SQLite, Postgres, or flat yaml files in a git repo).

Web-based editor for the data, with structured data input (text fields, repeatable chunks of forms [e.g., building schedules])
Historical archive of all changes
Queue for proposed changes, so that the app can submit them (anonymously?) and we can approve them
To go along with that, notifications of additions to the queue, so we can approve them promptly
Public API to query the current version of data and the historical diffs
Versioned data? I.e., I’d like to version the REST endpoints so that I can write a converter from the internal data structures that may change over time, to the historical data formats that may be expected by old clients. This doesn’t need to be part of thing, I suppose, but it’d be nice if it was capable of versioning data. I assume most server solutions would allow us to set up custom routes.
Open-source server (of course)

Our current solution, while not great (to say the least) gets us a historical archive, a queue (and could be automated with a bot), email notifications (thanks be to GitHub), and technically a web-based editor for the data.

Oh, and it’d be really nice if there was a way for someone to just download “everything” at once, instead of querying by data type? I suppose that could just be a new endpoint, for a server/db solution.

(Most of my concerns around historical data and data export are with regards to people who might want to see, say, the history of Holland’s open times, to link it with their project on … er, declining usage of the study spaces. Idk. I generally feel that if you have easy ways for the data to be spit out, and historical versions of the data too, then interested researchers can potentially get it and massage it for whatever their project is. Idk.)

Got a little rambly. Sorry.

rye commented 6 years ago

SQLite, Postgres, or flat yaml files in a git repo

Any one of these backing storage media sound fine to me.

Web-based editor for the data, with structured data input (text fields, repeatable chunks of forms [e.g., building schedules])

This is a must-have for any solution, since I think that's the biggest problem with our existing solution. We have no easy way of turning "table of hours" into "yaml file that is ready to ship." Once we solve that, everything else is just implementation details.

Historical archive of all changes

I do see merits to this. I wonder if we could export these from whatever storage to a backing store? For instance, store the live data in a database which represents the tip-of-tree of the history, but then we just commit the changes in a Git repository whenever they get made?

Queue for proposed changes, so that the app can submit them (anonymously?) and we can approve them

I want this feature, too. This would encourage people to actually report problems, and anonymity and reduction of friction would help us with that.

To go along with that, notifications of additions to the queue, so we can approve them promptly

Postfix is easy, and we could even set up our server as a send/receive SMTP server.

Public API to query the current version of data and the historical diffs

Could these be separate stories? I'm thinking of the current version of the data and the history having very different use cases. If our backing storage for history is a Git repository, I could see us wrapping that in an API front-end.

Versioned data? I.e., I’d like to version the REST endpoints so that I can write a converter from the internal data structures that may change over time, to the historical data formats that may be expected by old clients. [...]

I will leave that to you. I don't know exactly what we would need this for besides rigid backporting.

Open-source server (of course)

Naturally.

You know, the more I think about this, the more I think that our solution here could just be something that fronts a Git repository with a bit of translational magic; there's not a significant need for a database, which would really just be acting as a cache. The big thing I don't want to lose, though, is any functionality that may provide.

I also think the new implementation should serve all data needs. It should handle everything, from building hours to dictionaries, all under one unified umbrella. Clients should only submit the new versions of each thing, and I think that change proposals should be batched somehow.

Here's an idea for handling data submissions:

Client submits POST /proposals to get a new proposal ID.
Client submits POST /proposals/:id/:path/:to/:data: to add changes to their proposal.
Client submits POST /proposals/:id/commit when they have made their changes. At this point in time, Client's changes are committed into a Git commit on a unique branch for their proposal.

Heck. This could get done with a GitHub bot, even. It'd just need to have push access to branches and the ability to open PRs containing summaries. We still would need a remote server to do this, but this would cover submissions.

Then, to handle fetching data from the repository, a similar schema would be exposed: GET /:path/:to/:data. This way, /proposals mirrors normal data endpoints and is easier to understand. We would keep a server-side copy of the repository which is where the data that would get served would come from. Also, this would make potentially exposing endpoints for gathering historical data easier.

We could then use Redis with locking to keep a cache of data so we're not touching disk for every request.

Thoughts?

untrivializer[bot] commented 6 years ago

is easy

Did you mean, "might be straightforward, but could have unforseen complexities that would completely change the prioritization of the issue, so we should let it go through the normal planning and estimation process"?

hawkrives commented 6 years ago

I do think we should distinguish, real quick (I think we're on the same page, but I just want to say it anyway) to distinguish between "owned" data and "proxied" data.

That is, this is our Owned data:

hours
contacts
dictionary
webcams list
radio urls
map data (maybe?)
bus schedules
help tools

This is our Proxied data:

menus
calendar
streams
orgs
news
directory (if we make a custom view, maybe not even then, not sure if we want to be able to listen to that)
student orgs
student work positions
course data??? (It's nice having this offline)

So, I concur that all of our Owned data should be managed through this tool.

I'm good with figuring out some sort of caching, but that doesn't necessarily mean we need it for the first take.

I think we could very easily do a git repo as a backing store. That sounds fine to me.

rye commented 6 years ago

As it turns out, ccc-server has endpoints which proxy data requests, which solves the initial AC of getting data fetching untied from this repository. The near term step is going to be to rewrite all of our URLs to match this new scheme, then we can look into splitting this data out into a separate server.

StoDevX / AAO-React-Native

New data fetching strategy #2735

Premise

What is "data"?

Current data management strategy

Proposal