Replace scraping mechanism with something faster

dblock commented 7 years ago

Maybe pull the database records directly from a mongodb gravity connection? It fetches pages and does a lot of empty pages towards the end.

craigspaeth commented 7 years ago

👏 for the Cinder sitemap work! Please feel free to delete this repo/heroku app at your convenience.

I'll leave some quick historical context/clarity around Fusion here for those interested. Fusion was an experiment pre-Metaphyiscs to address two things—

Web Engineering had kicked around the idea of an orchestration API on top of Gravity to serve cached blobs of data that avoided all the ugly crawling/caching code in Force & friends (GraphQL was barely a thing at time).
We were implementing a bunch of a new sitemaps and needed to get them from multiple services (namely, but not necessarily limited to, Positron and Gravity).

These two "efficiently serve aggregated data from multiple services" use cases seemed close enough to attack with the same layer so I went off to try Fusion out on solving 2 with the intent to expand to 1 and ironically Damon simultaneously felt it was about to time to try out a solution for 1—we had a chuckle and discussed how to merge our efforts. Of course Metaphysics became a big success and we never got around to folding Fusion's use case back into it.

I explained some of the reasoning for choosing dynamic sitemaps powered by APIs here—but in summary it was simpler to maintain, with fresher data, and didn't require direct database access (at the cost of running the occasional expensive query on said APIs). Fusion's scraping was necessary to work around not being able to, and not wanting to expose the performance impact of, querying artworks by date range in Gravity. Having a slowly updated stale cache of API data also seemed like a useful feature for an orchestration layer and hence the reason for exploring that approach.

Hope that helps clear some things up about Fusion. Glad to see us take this service out of the mix and leveraging Cinder for this work now 👌.

dblock commented 7 years ago

Thanks for the story @craigspaeth that makes a lot of sense! I think in many ways Cinder/Spark/Hadoop fundamentally are the same thing when it comes to data, a copy of all artsy data from all over the place. Spark adds a parallel processing layer making things extremely fast, ie. generating sitemaps in a non-dynamic way with a turnaround of a few minutes, so we don't need to query live data anymore to generate them except in the rare small cases like news articles.

Another interesting aspect is that sitemaps are supposed to be large, with up to tens of thousands of URLs, so that generally fits poorly performance constraints on a dynamic API of any kind and much more suitable for statically generated files.

artsy / fusion-deprecated

Replace scraping mechanism with something faster #9