code-corps / code-corps-api

Elixir/Phoenix API for Code Corps.
https://www.codecorps.org
MIT License
235 stars 86 forks source link

Epic: Implement GitHub integration sync #1368

Open joshsmith opened 6 years ago

joshsmith commented 6 years ago

Problem

We need to reliably and concurrently sync tasks and comments to and from GitHub in a non-blocking way.

By providing a timestamp-based concurrency control system we can use a known algorithm to make our GitHub integration more robust.

More importantly, we will be able to unblock our other objectives. We cannot proceed with onboarding projects or volunteers unless GitHub sync is stable, since our overall strategy depends on us connecting volunteers to tasks.

Tasks

In scope

Out of scope

Outline

We would have a sync operation for each type of internal record we want to sync. For example:

Every sync operation record, regardless of type, would have a:

Then each type would have type-specific fields, e.g. a CommentSyncOperation would have:

If the event is due to the resource being created, there will not be a conflict. If the resource was created from our own clients, then there is no external GitHub ID yet. The same is true of events coming in from external providers and there is no internal record yet. I'm not yet clear as to whether we should conduct any conflict checking on these event types, but my guess is no. It should likely jump straight to :processing.

When an event comes in from GitHub we should (using a github_comment as our example):

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the operation's timestamp. If it is, then we need to bubble the changeset validation error and mark the operation as :dropped per the above.

Some upsides of the approaches above that I wanted to document, in no particular order:

joshsmith commented 6 years ago

A thought on the queue: perhaps we could spawn a task every time a operation record is enqueued or processed. The task looks for operation records to process and exits if none are currently able to be processed.

We would want to process every operation that does not have a operation for the same record being processed currently.

By way of example, assume the following operations:

  1. github_issue_github_id: 1 - processing
  2. github_issue_github_id: 1 - queued
  3. github_issue_github_id: 2 - queued
  4. github_issue_github_id: 2 - queued

If we were to spawn a task now, we would get the following:

  1. github_issue_github_id: 1 - processing
  2. github_issue_github_id: 1 - queued
  3. github_issue_github_id: 2 - processing
  4. github_issue_github_id: 2 - queued

Assume 1 now moves to processed. Then our task would we would now move 2 to processing while 4 still remains queued:

  1. github_issue_github_id: 1 - processed
  2. github_issue_github_id: 1 - processing
  3. github_issue_github_id: 2 - processing
  4. github_issue_github_id: 2 - queued

Later still, 3 moves to processed, so then 4 moves to processing:

  1. github_issue_github_id: 1 - processed
  2. github_issue_github_id: 1 - processing
  3. github_issue_github_id: 2 - processed
  4. github_issue_github_id: 2 - processing

And so on.

The queue is therefore asynchronous across resources but synchronous per resource.

joshsmith commented 6 years ago

Tightly grouped events happen fairly frequently, particularly when issues/pull requests are labeled, assigned, etc at time of creation:

The image above shows multiple "simultaneous" events occurring on the same pull request.

We'll likely need to consider some type of ordering-system that is not timestamp-based solely, but some combination of the timestamp, event, and data that changed. In the short-term, it can probably be resolved by dropping those events and simply using the fetch of the remote data to make our changes.

joshsmith commented 6 years ago

Might I suggest a quick name change here: instead of TaskSyncTransaction we could call this a TaskSyncOperation. Thoughts?

begedin commented 6 years ago

Operation means less overlap and confusion when discussing actual transactions, so I'm all for it.

joshsmith commented 6 years ago

I've updated the bits above to use operation over transaction everywhere, except the Ecto.Multi case.