Merging Metaphysics and plans for the future

craigspaeth commented 9 years ago

This plan of having a web-serving API already solved a big problem (sitemaps) and has alleviated throughput to the formerly most time consuming endpoint on Gravity—which makes the main API overall healthier. So I think those two reasons alone are good enough evidence that these projects are good ideas.

I say let's consolidate Fusion and Metaphysics and lay out some plans for how to make this legit. Here we go...

Name of project & open source

Personally I like the name Fusion better and I think the repo should be open source :smile_cat:. But not strongly opinionated here, leave your comments if you feel strongly opinionated.

ES6 vs. ES5 vs. Coffeescript

Personally would generally like to move web towards ES6 just b/c new hires will likely be more familiar with it than coffeescript and I can't live without destructuring assignment and arrow functions. I think this project can be a good place to start with that. But we're a democracy and would love to hear other's thoughts on that.

Promises vs. Callbacks

I'm really on the fence with this one b/c I love that promises can bubble up errors and provide better stack traces, but I've found the learning curve steep (for myself and new hires) and interoperability really awkward. Also I don't know how practically beneficial the stack traces are when Bluebird mentions they come at a substantial performance penalty when turned on (which we learned is unfortunately not ignorable given Q was the primary cause of memory leak). Without long stack traces I find dropping if (err) return callback(err) all over the place slightly nicer/more obvious than wrapping APIs in promises. That's just where I'm at right now though, totally willing to be convinced of promises.

Database

I love Redis, but I think this project should have a database meant to write to disk b/c 1. I'd like it to solve the sitemaps concern and that involves having GBs of artwork data persisted, and 2. I think ideally it should be able to performantly join a bunch of data without having to hit the API first (e.g. this could mean map-reducing common joined views or application/RDBMS joins across collections/tables). Either way this would be unnecessarily costly on Redis. I also think Postgres/Mongo/some-other-thing are capable of serving fast enough responses for a good while in the beginning (but not apposed to introducing a cache db later).

Replacing application-side caching

Given the point about persisting a lot of stale data, I think we could also remove our Force/MG-side caching in favor of this layer. That would not only cost less :moneybag: it would allow us to consolidate caching logic and allow for building smarter cache invalidation tooling than telling someone to just blow up the cache once and a while. One tradeoff I can see here is that we would hit a perf. penalty simply by transporting over HTTP while a Redis GET is always going to be a lot faster. That said, if this layer can serve one bundled up response per page fast enough than I don't think 100ms http request vs 10ms Redis GET is going to matter.

Authorized views of data

I think initially we should probably just attempt to store public views of data and this layer can join that with stateless responses from Gravity (passing through an access token to request/superagent). Down the line maybe we could store user data bcrypting their access token for lookup, and/or store an admin access token on the app, and attempt to do some stateful authorized data logic in this layer—although that does get sticky w/ duplicating visibility logic b/t Gravity and this. Would love to hear thoughts.

Concluding

That's all I can think of for now, please leave comments/edit this with more thoughts. :boom: :confetti_ball:

dzucconi commented 9 years ago

ES6 vs. JS vs. Coffeescript

ES6 no question

Promises vs. Callbacks

If we’re doing ES6, why not generators?

I think callbacks are fine as long as we deal with them in a sane manner: adhere to the whole Node continuation passing style of (err, callback). + heavy reliance on the async library.

We’d definitely want to build up some toolkit or DSL for dealing with building aggregate responses.

(That said I think the promises code in Force is soooooooo much easier to understand, reason about, and maintain.)

Database

This issue makes me think I don't understand this project actually.

The sitemaps are traversing a flat data model in a controlled fashion (?) But our real issue is gluing together a set of complex + dependent APIs into a single response.

Why we would ever need to store more than a blob of data and query using a flat key: we just want to glue some responses together, store them, then make it easy to refresh their cache in an async fashion.

I don’t want to be writing more database queries. And at that point I'm not sure what the difference is between just setting up a read only Mongo slave and talking to it directly?

I just want to say: "this page needs to hit these APIs in this order and then spit out a response that looks like ____."

joeyAghion commented 9 years ago

What are the specific bottleneck(s) these 2 experiments aim to address? Sounds like sitemaps were one driver. Were there others?

(Also, what's the "most time-consuming endpoint" you mention? I thought it was artwork-filtering or show-artworks, but this sounds different.)

Regarding databases, this is actually one case where I might (maybe) recommend MongoDB. It's actually a much better match for this than for our primary use of it for the API, since your source data is already JSON and you'll be serving it as recombined JSON. I imagine a really simple schema in which a URL serves as the primary key and the document is basically the original API response at that URL.

One concern I have about launching into this is that the orchestration layer may become the most expedient way to patch the API. In some cases, it would be better for the whole ecosystem if the API learned from those uses instead.

dzucconi commented 9 years ago

@joeyAghion

For me: It's not a question of there being a single bottleneck or a slow endpoint. The problem is that we need to make a ton of fetches to construct some pages and the latency alone is going to kill us.

The simple example I addressed in metaphysics: the artwork page needs the related shows/fairs/sales in order to construct the above the fold view: https://github.com/artsy/metaphysics/blob/d3876b2141770b9b11b788016d7d404f59a094ff/blobs/artwork.js#L11-L14

Currently in Force it's constructed in such a way where all the related bits are fetched client-side and it's a stupid mess because of that.

Another example, artist carousel: https://github.com/artsy/force/blob/master/apps/artist/carousel.coffee —Tons of stuff, gets cached in a single key, but the cache doesn't get purged until someone manually purges it or it expires.

Another example, artist navigation: https://github.com/artsy/force/blob/master/apps/artist/statuses.coffee —same deal.

As far as the DB goes—My perspective is that caching individual endpoints is just not necessary. If we're just caching those then we have to rebuild the response on every request from the cached data before sending it to the user. If we just cache the glued together blob then we can serve that first, and re-build it after the response. The only main annoyance I see with Redis is serialization.

joeyAghion commented 9 years ago

I understand the pattern but still want to know the "immediate" examples because they'll help us evaluate the success, I think. These are helpful. In fact, it would be great to launch something in production that handles just 1 of these to start.

I immediately assumed we'd want to cache individual ("raw") responses because:

More than 1 "orchestrated" response may need the same raw response. If an earlier orchestrated response cached a raw API request, other orchestrated responses can re-use it.
Different raw responses might be reasonably cached for different amounts of time (e.g., artworks change more often than partners).
API responses support caching headers like ETag or If-Modified-Since.

broskoski commented 9 years ago

To re-summarize the problems this approach needs to solve (in the context of the Artwork page):

The artwork page is among the most complex of all web (Force + MG) because of all the possible states it can be in (in a fair, in an auction, in an open auction + with e-commerce enable, inquireable, contactable, etc).
Right now we solve this with a hodge-podge of client-side conditionals that try to compute the state of the page asynchronously after the artworks related models are fetched (sale, fair, partner, etc)
We could get big wins just from orchestrating and caching these nested requests and returning the original responses + computed values (i.e. is_contactable). This is 90% of the issue we are trying to solve for
There is also a potential to solve the problem of "This artwork is in a Tier 1 gallery, the client should display it in a specific way". Solving this problem would complicate things significantly, we would have to make admin requests through the client, compute values and check if the user has permissions. If we can somehow manage to not have different caches of an object on a per-role basis, that would be ideal.

Regarding the specifics of implementation, I'm open. ES6 sounds great. I am also undecided on if duplicating artworks in another database helps or makes things more complicated.

Regarding name + open source, I'm open, as long as we aren't computing client-side values with sensitive data (i.e. carouselSize: -> artwork.partner.get('tier'))

craigspaeth commented 9 years ago

:+1: Thanks for the comments @dzucconi.

Promises vs. Callbacks

I could be down for generators, but haven't really worked with them to know the tradeoffs there. Do agree that with vanilla callbacks we'd want to make heavy use of async and build some toolkit for DRYing up aggregate response code (which we might still want to do with generators/promises), so dunno which one feels better to me yet—but open to all options.

Database

I guess I have an interest in persisting large collections of data b/c it would allow this layer to serve more use cases than, for lack of a better term, "just caching". Like it could serve other things where reusing stale data is okay—e.g. sitemaps and merging stale data with Google's search API to smooth over search result issues.

@joeyAghion: I definitely hear you that this could potentially be abused as a way to patch the API when changes downstream should ideally take place. I think we need to tread carefully in the beginning and default to core API changes first.

Also what I meant by most time consuming was in the NewRelic sense, e.g. most frequently hit vs. response time (or at-least I think that's what that means):

Gravity before deploying Fusion experiment:

after:

joeyAghion commented 9 years ago

I meant the same New Relic sense, but I must have been looking at the metrics after Fusion. Great!

BTW, there's no reason this layer can't both cache raw responses and save richer ones.

dzucconi commented 9 years ago

Yeah, I guess we're going to be writing a serializer/model thing for everything anyway then it does make sense to give the models their own caching adapters.

@joeyAghion As an immediate example; whatever service this evolves into would want to be able to replace the fetch/cache stuff we do to render the above the fold content on the artist page.

So just to render that bit on the server we need the following

# Get carousel figures:
# Gets wrapped up in the cache key: `artist:artist_id:carousel`
[
  # Get iconic works
  "GRAVITY_URL/api/v1/artist/:artist_id/artworks{ published: true }"

  # Get installation shots:
  "GRAVITY_URL/api/v1/related/shows?artist_id=:id{ size: 10, solo_show: true, top_tier: true, displayable: true, sort: '-end_at' }": [
    # Filter down to only the shows that have `images_count > 0`
    # Of those:
    [
      "GRAVITY_URL/api/v1/partner_show/:show_id/images{ default: false, size: 1 }"
      "GRAVITY_URL/api/v1/partner_show/:show_id/images{ default: false, size: 1 }"
      # ... up to 10 but not likely
    ]
  ]
]

# Get context of an artist (their 'statuses')
# Gets wrapped up in the cache key: `artist:artist_id:statuses`
[
  "GRAVITY_URL/api/v1/search/filtered/artist/:artist_id/suggest"
  "GRAVITY_URL/api/v1/related/shows{ size: 1, artist_id: artist_id[], sort: '-end_at', displayable: true }"
  "GRAVITY_URL/api/v1/related/layer/main/artists{ size: 1, artist_id[]: artist_id, exclude_artists_without_artworks: true }"
  "GRAVITY_URL/api/v1/related/layer/contemporary/artists{ size: 1, artist_id[]: artist_id, exclude_artists_without_artworks: true }"
  "FORCE_URL/artist/data/:artist_id/publications{ size: 1, artist_id[]: artist_id, merchandisable[]: false }"
  "FORCE_URL/artist/data/:artist_id/publications{ size: 1, artist_id[]: artist_id, merchandisable[]: true }"
  "FORCE_URL/artist/data/:artist_id/publications{ size: 1, artist_id[]: artist_id }"
  "FORCE_URL/artist/data/:artist_id/collections{ size: 1, artist_id[]: artist_id }"
  "FORCE_URL/artist/data/:artist_id/exhibitions{ size: 1, artist_id[]: artist_id }"
  "GRAVITY_URL/api/v1/artist/:artist_id/": [
    "POSITRON_URL/api/articles{ actual_artist_id: id, published: true, size: 1 }"
  ]
]

OK so let's say we have a magical thing that can parse the above.

It can expose those as endpoints OR we could then write a further thing for getting them at once:

# Get artist_page
[
  "GRAVITY_URL/api/v1/artist/:artist_id"
  "artist:artist_id:carousel"
  "artist:artist_id:statuses"
]

So: the single get would then serve the cached blob; falling thru to fetch where needed and then after serving the request; fan out all those individual fetches and rebuild the cache.

(We may need a real queue for managing that too?)

craigspaeth commented 9 years ago

I like your thinking Damon! Now that you paste in that pseudo-code it seems very likely we'll want to write some kind of DSL/tooling regardless of flow control approach (maybe even an opportunity for OSS library).

I also think we should aspire to be able to serve up the best response possible so going for the further getting at once rather than stopping at providing those each as endpoints (that said, 3 parallel fetches will probably take nearly the same time—however one fetch w/ all the data is still easier than dealing with parallel fetches).

Food for thought...if we end up storing & refreshing entire datasets of endpoints e.g. db.v1_artworks and db.v1_artists we could join those db/application-side instead of using "GRAVITY_URL/api/v1/artist/:artist_id/artworks{ published: true }". Of course that comes with it's own tradeoffs/complications—but maybe it's worth considering an approach where it's more like this layer crawls endpoints in the background, refreshes based on it's own internal querying logic, allows for it's own rich api queryability, and essentially hides the endpoints structure of Gravity.

I also wonder if we should allow this layer to take care of some of the logic in our backbone mixins (obviously the fetching stuff is natural, but maybe even some of the common data normalizing makes sense). Some of these ideas might be going too far—but just throwing these out for the sake of convo.

Finally to throw us off track with some buzz words here:

Would we consider not using Node—e.g. Clojure, Go, or Elixir?
Maybe it's worth experimenting with GraphQL + Relay and/or Netflix's Falcor, b/c the underlying philosophies cross over here

orta commented 9 years ago

to throw in a single naive buzzword, how does this mix with all the hypermedia v2 API stuff?

craigspaeth commented 9 years ago

So despite having my qualms with v2/hypermedia in the past, I think this would actually be a good candidate to take advantage of v2. e.g. If we decide to pursue something like download endpoints in the background and refresh individual resources, doing joining and such fusion-side, then v2 would lend itself well to this with it's cursor based pagination and highly normalized responses.

dzucconi commented 9 years ago

You're not talking about exposing whatever Fusion does to the client's hypermedia style right? (Which would be non-ideal to consume)

It's probably worth our time to just strap one of the graphql implementations to the mongodb and see what's up. Since that looks like it solves almost all of my issues.

craigspaeth commented 9 years ago

No no, I'm still talking about Fusion exposing data in a composed/graphql-ey manner (one big blob of json no _links) and composing it's blobs by crawling v2 + refreshing individual resources. The reason I say v2 lends itself well to this approach is b/c cursor based pagination should performantly allow crawling through an entire endpoint and the highly normalized responses would mean we wouldn't have to worry about cleaning out duplicated data as much.

dblock commented 7 years ago

We can close this as fusion is gone.

artsy / fusion-deprecated