Observatory Release v 1.3.0

talos commented 7 years ago

Request for a new Data observatory extension deploy

I'd like to request a new data observatory extension deploy: dump + extension

Dump database id to be deployed

Please put here the dump id to be deployed: link removed

This includes new ACS data (2011-2015)

This is the same dump as from #240, #242, and #244, which were never deployed. There is no new data since then.

Performance tests

link removed#d745f07cac9ac00d25be1c6e7b5e750413b4bba9..0e4a51475361914279500ad285c9d3f611b3f31c

Takeaways:

OBS_GetMeasure with no specified boundary (which is what we currently have in the UI) improves 60-70% for simple geometries, from between 15 and 20 QPS to over 25 QPS. For complex geometries, improvements are on the order of 20-40%, from 1.25-2 QPS to 1.5 to 2.5 QPS.
OBS_GetMeasureByID performance regresses considerably, since we now do a more complicated metadata check per row. This function isn't exposed anywhere in the UI, and still performs at over 200QPS.
OBS_GetCategory performance for complex geometries improves dramatically, from a fraction of 1 QPS to between 6 and 10.
OBS_GetCategory performance for simple geometries regresses considerably, from over 400 QPS to about 120. This is again because of slower metadata lookups.

All the above deliver acceptable row-by-row performance, considering the all-in-one OBS_GetData function delivers between 150 and 400QPS, improving as more rows are run through it.

A nice value-add for this is the improvement in simple and complex row-by-row OBS_GetMeasure, which will be enjoyed immediately in the existing UI.

Data Observatory extension PRs included.

239
(#240)
241
(#242)
243
(#244)
245

// @CartoDB/dataservices

Supersedes #240, #242, and #244.

ethervoid commented 7 years ago

There are important changes in this PR to take into account:

Adds PLPython support
Also, adds its own python library, with its own dependencies so we have to be very careful with this change

This new dependency from a python library is going to make the blue/green deployment more difficult because we don't have virtualenvs for different deployed versions. Is this python library necessary @talos? are you planning to add more logic to it? Maybe we should add the library logic inside that functions, I know is uglier but by now could be an option.

I'm changing the makefile to mimic the crankshaft one because observatory extension has the same structure than crankshaft.

We should talk with @luisbosque to inform him and systems about this two new dependencies added.

// @rafatower @rochoa

talos commented 7 years ago

These are good points.

First off, the performance improvements are unrelated to PL/Python and the Python library. If necessary, I can chop this out of the observatory-extension release for now.

OTOH, if it's not too much work to include special requirements.txt, I can definitely move the logic to inside a PL/Python function. Unlike Crankshaft, there's really very little processing or reusable Python code -- these are just thin wrappers over HTTP requests, in this case to the OSM API.

If having special requirements is going to be a major deployment headache, I propose eliminating the Overpass code for now, and possible reimplementing it directly in PL/Python using a builtin HTTP library in Python (or requests, if it's already available.)

In summary, three options:

Cut out the PL/Python code entirely, table it for later, possibly through an implementation avoiding a requirements.txt
Move the special Python library inside the PL/Python code, but keep the special requirements.txt
Figure out a way to deploy the Python library alongside its specific requirements

In the end (3) would be the most luxe solution, as it could be very valuable to take advantage of 3rd party API wrappers. But if this is going to seriously delay stuff, I prefer (1) because the performance fixes and new API methods are the main value-add here.

Thoughts?

ethervoid commented 7 years ago

From my point of view, I'll chop the plpython/python code from this deployment and make a "special" deployment just for the new python code in order to think about it exclusively.

I wouldn't create a new library for that tiny code, I'll include it inside the plpython functions. In the future, if necessary, we could add it as a new library and think how to make the blue/green deployment work with this dependency.

talos commented 7 years ago

OK, that sounds good. Do you want me to do the chop?

ethervoid commented 7 years ago

Yes please, it's a huge PR and I don't have all the knowledge.

talos commented 7 years ago

👍 on it.

talos commented 7 years ago

@ethervoid Take another look? Having an issue with the removed dependency preventing tests from running, but it should be good -- I'll ping again when I get tests running.

ethervoid commented 7 years ago

Ok, on it thx!

rafatower commented 7 years ago

What's the 60_observatory_providers.sql file for?

talos commented 7 years ago

It was a stub to add external data providers, a sample OSM was included.

The refactor emerged in part from trying to make our functions more flexible to return both data and geometries from a third party provider. However, it's not prerequisite to the improvements delivered to the rest of the Observatory.

talos commented 7 years ago

@rafatower If you meant why there's an empty file sitting there now, it's because if a build was done with that there, it for some reason will not erase the code in the build file from the 60_* file if it is removed -- it has to be replaced with a blank file.

talos commented 7 years ago

@ethervoid Resolved, we need an empty file where the old one was to make sure dead code doesn't persist in the build artifact (!)

Tests are all passing locally.

rafatower commented 7 years ago

I meant both, thanks for the explanation.

IMO the OBS should do what OBS does well, and do not rely on 3rd party services to fulfill its API contract. I'd stick to some design principles, such as SRP and ISP

Beyond principles, think that limiting the API to some cohesive functions is what let us optimize and guarantee a certain quality of service. If you introduce an external dependency, you lose control over your own API. Plus there's on-premises.

We already use some (a lot) plpython code for normal DS-API operation. So please, if you really need to use plpython or any extra dependencies let's discuss it before proceeding.

talos commented 7 years ago

@rafatower We should talk about it tomorrow.

What we're hitting up against is that there are datasources that make logical sense to be integrated into Observatory APIS as they can play nice with metadata (say, weather data, or soon, point of interest data that may be affordable or only available on a request by request basis) but which simply cannot be ETL'd or distributed using existing mechanisms.

rafatower commented 7 years ago

Sure, let's talk about those.

ethervoid commented 7 years ago

@talos I've removed the 60_* file, not need, we generate the release build anytime we want so no problem about the dead code.

talos commented 7 years ago

Sounds good. I've since rebuilt locally, so it's not necessary anymore. Thanks!

CartoDB / observatory-extension