Closed talos closed 7 years ago
There are important changes in this PR to take into account:
This new dependency from a python library is going to make the blue/green deployment more difficult because we don't have virtualenvs for different deployed versions. Is this python library necessary @talos? are you planning to add more logic to it? Maybe we should add the library logic inside that functions, I know is uglier but by now could be an option.
I'm changing the makefile to mimic the crankshaft one because observatory extension has the same structure than crankshaft.
We should talk with @luisbosque to inform him and systems about this two new dependencies added.
// @rafatower @rochoa
These are good points.
First off, the performance improvements are unrelated to PL/Python and the Python library. If necessary, I can chop this out of the observatory-extension release for now.
OTOH, if it's not too much work to include special requirements.txt, I can definitely move the logic to inside a PL/Python function. Unlike Crankshaft, there's really very little processing or reusable Python code -- these are just thin wrappers over HTTP requests, in this case to the OSM API.
If having special requirements is going to be a major deployment headache, I propose eliminating the Overpass code for now, and possible reimplementing it directly in PL/Python using a builtin HTTP library in Python (or requests, if it's already available.)
In summary, three options:
In the end (3) would be the most luxe solution, as it could be very valuable to take advantage of 3rd party API wrappers. But if this is going to seriously delay stuff, I prefer (1) because the performance fixes and new API methods are the main value-add here.
Thoughts?
From my point of view, I'll chop the plpython/python code from this deployment and make a "special" deployment just for the new python code in order to think about it exclusively.
I wouldn't create a new library for that tiny code, I'll include it inside the plpython functions. In the future, if necessary, we could add it as a new library and think how to make the blue/green deployment work with this dependency.
OK, that sounds good. Do you want me to do the chop?
Yes please, it's a huge PR and I don't have all the knowledge.
👍 on it.
@ethervoid Take another look? Having an issue with the removed dependency preventing tests from running, but it should be good -- I'll ping again when I get tests running.
Ok, on it thx!
What's the 60_observatory_providers.sql
file for?
It was a stub to add external data providers, a sample OSM was included.
The refactor emerged in part from trying to make our functions more flexible to return both data and geometries from a third party provider. However, it's not prerequisite to the improvements delivered to the rest of the Observatory.
@rafatower If you meant why there's an empty file sitting there now, it's because if a build was done with that there, it for some reason will not erase the code in the build file from the 60_*
file if it is removed -- it has to be replaced with a blank file.
@ethervoid Resolved, we need an empty file where the old one was to make sure dead code doesn't persist in the build artifact (!)
Tests are all passing locally.
I meant both, thanks for the explanation.
IMO the OBS should do what OBS does well, and do not rely on 3rd party services to fulfill its API contract. I'd stick to some design principles, such as SRP and ISP
Beyond principles, think that limiting the API to some cohesive functions is what let us optimize and guarantee a certain quality of service. If you introduce an external dependency, you lose control over your own API. Plus there's on-premises.
We already use some (a lot) plpython code for normal DS-API operation. So please, if you really need to use plpython or any extra dependencies let's discuss it before proceeding.
@rafatower We should talk about it tomorrow.
What we're hitting up against is that there are datasources that make logical sense to be integrated into Observatory APIS as they can play nice with metadata (say, weather data, or soon, point of interest data that may be affordable or only available on a request by request basis) but which simply cannot be ETL'd or distributed using existing mechanisms.
Sure, let's talk about those.
@talos I've removed the 60_* file, not need, we generate the release build anytime we want so no problem about the dead code.
Sounds good. I've since rebuilt locally, so it's not necessary anymore. Thanks!
Request for a new Data observatory extension deploy
I'd like to request a new data observatory extension deploy: dump + extension
Dump database id to be deployed
Please put here the dump id to be deployed: link removed
This includes new ACS data (2011-2015)
This is the same dump as from #240, #242, and #244, which were never deployed. There is no new data since then.
Performance tests
link removed#d745f07cac9ac00d25be1c6e7b5e750413b4bba9..0e4a51475361914279500ad285c9d3f611b3f31c
Takeaways:
OBS_GetMeasure
with no specified boundary (which is what we currently have in the UI) improves 60-70% for simple geometries, from between 15 and 20 QPS to over 25 QPS. For complex geometries, improvements are on the order of 20-40%, from 1.25-2 QPS to 1.5 to 2.5 QPS.OBS_GetMeasureByID
performance regresses considerably, since we now do a more complicated metadata check per row. This function isn't exposed anywhere in the UI, and still performs at over 200QPS.OBS_GetCategory
performance for complex geometries improves dramatically, from a fraction of 1 QPS to between 6 and 10.OBS_GetCategory
performance for simple geometries regresses considerably, from over 400 QPS to about 120. This is again because of slower metadata lookups.All the above deliver acceptable row-by-row performance, considering the all-in-one
OBS_GetData
function delivers between 150 and 400QPS, improving as more rows are run through it.A nice value-add for this is the improvement in simple and complex row-by-row
OBS_GetMeasure
, which will be enjoyed immediately in the existing UI.Data Observatory extension PRs included.
239
241
243
245
// @CartoDB/dataservices
Supersedes #240, #242, and #244.