balanced / balanced-api

Balanced API specification.
221 stars 72 forks source link

Transaction export #498

Open ljank opened 10 years ago

ljank commented 10 years ago

Hello,

we'd like to run periodic transaction imports (see https://github.com/balanced/balanced-api/issues/491#issuecomment-33150111 for more details) and currently there are only 2 options:

  1. CSV export, which is e-mailed with a link to S3. Would be super cool if e-mail part could be skipped :)
  2. API (/transactions or all transaction related endpoints separately). This option is super slow and thus paging isn't very handy (hard to know where to start in order to import only missing transactions).

Are there any plans to improve any of these options? Or maybe there will be some other options? Thanks!

steveklabnik commented 10 years ago

We don't generally have much accounting, though we do have a really good integration with SubLedger: http://subledger.com/blog/rent-my-bikes-demo/

I'm totally open to improving these kinds of things at some point, but I need to understand what the primary use-cases are.

Basically, "synch up my local transactions with all the ones I have on Balanced," right?

ljank commented 10 years ago

We already have almost everything SubLedger does, but we need to be sure we're on the same page on both sides. CSV export fits our requirements very well, except that e-mailing part and inability to request only partial export (for some specific date range). API endpoint with CSV output would be super cool :)

mjallday commented 10 years ago

@ljank would a JSON dump be more suitable? CSV seems unwieldily to me (but that may just be the engineer in me talking, not the get shit done side).

The current issue with the API access (at this point in time) is that getting you all those transactions on large marketplaces takes too long to be an online request. That's why we opted to do it as an offline operation and email it through.

While I'm thinking of quick fixes to your immediate problem: what if we had the CSV renderer callback to your app with the data once it's ready to be consumed? That would remove the need for email and make it easier to programmatically consume the data.

ljank commented 10 years ago

@mjallday sure thing, JSON would be even better (and consistent), just thought about CSV because of current implementation (implementation bias?).

Callback sounds reasonable because of potentially long data preparation time, however, what if you'd allow to request data for specific date ranges and limit that range to one month (or depending on marketplace size to week/day/hour, etc)? That way I believe it could be made online without the need for a webhook.

Also, if the data would be selected from DB replica (designed for random access only), it wouldn't have any effect on production performance. However, I'm pretty sure you already have something like that or even better, so please ignore :)

mjallday commented 10 years ago

We're already on a read-replica for data generation in this case, the main issue with doing it inline is we can't make guarantees about the render time. E.g. We can't say "if this request is constrained by dates and will take 10 seconds to render then do it inline else call back after the operation is complete".

In terms of ease of implementation my guess at the order would be:

  1. Date constraints
  2. API endpoint (e.g. HTTP request instead of using the dashboard)
  3. Callbacks
  4. JSON rendering

These sound like good bootcamp tasks for a new engineer!

ljank commented 10 years ago

How about avoiding callbacks altogether? Not sure if it's a good practice, but i.e. Google Analytics API returns HTTP 503 Backend Error if response takes more than X seconds and suggests to retry once more or simplify the request query (decrease date range in this case). Callbacks, IMHO, are more complicated on both sides and even more complicated when alternative scenario needs to be done ("oops, takes too long, callback later").

ljank commented 10 years ago

Or alternatively you could implement simplified callback version: simply return URI to the future result, which then could be used to fetch the results when they are ready, i.e.:

  1. POST /v1/marketplaces/<...>/transactions?period_from=2014-01-01&period_to=2014-01-15200 OK, {"result_uri": "/v1/results/<UNIQUE_RESULT_ID>"}
  2. GET /v1/results/<UNIQUE_RESULT_ID>503 Service Unavailable with header Retry-After: <estimated_delay>
  3. (retry after estimated delay) GET /v1/results/<UNIQUE_RESULT_ID>200 OK, {<transactions>}

What do you think?

steveklabnik commented 10 years ago

@ljank yeah, that general idea is good, but it's usually a 202 plus polling:

POST /.....transactions?...

202 Accepted
Location: /blahblahblah

GET /blahblahblah
200 OK
{"status":"pending"}

..... wait....

GET /blahblahblah
200 OK
{ data }
ljank commented 10 years ago

ping? :)

ljank commented 10 years ago

anyone? :)