CartoDB / cartodb

Location Intelligence & Data Visualization tool
http://carto.com
BSD 3-Clause "New" or "Revised" License
2.75k stars 650 forks source link

Data observatory frontend integration #11663

Closed nobuti closed 7 years ago

nobuti commented 7 years ago

In order to revamp the Data Observatory analysis form (design here https://invis.io/5G9YKGISB), the current SQL call is not enough, and we need to use all the helpers already available in the API.

@talos we need some guidance, let me do some questions about the Data Observatory:

A measurement has a structure like this, according to the API response:

{
    numer_id: "xxx",
    numer_name: "...",
    numer_description: null,
    numer_weight: "3",
    numer_license: null,
    numer_source: null,
    numer_type: "Numeric",
    numer_aggregate: "sum",
    numer_extra: null,
    numer_tags: {
         section/tags.ca: "Canada",
         unit/tags.people: "People",
         subsection/tags.language: "Language",
         license/ca.statcan.license.statcan-license: "Statistics Canada Open Licence Agreement",
         source/ca.statcan.license.statcan-census-2011: "Statistics Canada Census of Population 2011"
    },
    valid_denom: false,
    valid_geom: false,
    valid_timespan: false
},

@rafatower do you see any troubles in order to get the quota info? Take in mind we can have multiple measurements embedded in the same form.

urbanphes commented 7 years ago

@nobuti I'll try to answer some questions:

analysis - enrich from data observatory 05 https://projects.invisionapp.com/d/main#/console/10012782/220110025/preview

nobuti commented 7 years ago

@urbanphes thanks but I meant from the API point of view, is that available in the response data returned by the API?

rafatower commented 7 years ago

@rafatower do you see any troubles in order to get the quota info? Take in mind we can have multiple measurements embedded in the same form.

The only thing that you should take into account is that in order to calculate the quota to be used you have to multiply N rows x M measurements.

talos commented 7 years ago

Let me take a stab at answering...

Could you point us to the proper documentation? I know we have all signatures listed here https://github.com/CartoDB/dataservices-api/blob/development/client/renderer/interface.yaml but it should be great to know what method to use when.

The best place to look for right now is in the OBS Explorer, which provides a way to navigate metadata visually and is powered by the dimensional metadata functions (OBS_GetAvailableNumerators, OBS_GetAvailableDenominators, OBS_GetAvailableGeometries, and OBS_GetAvailableTimespans). The relevant code for that is here.

I've also made a ticket that I will take care of today for full documentation for these functions.

How do we need to manage the geometry param? I can see some methods accept a bounds param of type geometry, but not sure how to deal with it. Can we pass a country name? Do you have an example of this param using the current bounding box?

The geometry param has to take a geometry, not a country name. The example above uses the current user's bounds. My thought would be we should use the greater of the current map view and the user's own bounds. Using either an envelope or a geometry is fine; if we use a proper geometry, then measures that might be inside an envelope but not that specific geometry will be excluded.

Related to above, how do we get the regions available? If you see the design, there is a number next to the region's name, I guess is the number of measurements available in that region. Does the API provide this number?

Regions available are provided via tags. I will provide SQL that will provide this rollup.

We can't see the new column name to augment the dataset, how does it work?

Once a user's selection is made, you should call the OBS_GetMeta function. This will provide all missing information, including a colname that you can work with. However, it still would be wise to append some additional metadata to construct your own colname. I will provide an example of a strategy to do this.

In order to implement the filters, should we use tags?

Yes. The filters as envisioned in the mockups are the subsection tag type.

I see the description is null in almost every result, how we get the description of a measurement? and for filters?

Some measurements don't require an extended description, and for those we return NULL. Names should always be human readable, and in such cases where description is NULL the name should be sufficient.

I'm not quite sure what you mean by a description for filters.

In order to run the analysis, what parameters are mandatory? region and measurement? all but normalize?

In order to run the analysis, OBS_GetMeta needs to be run first to "fill out" all metadata fields that the user hasn't yet specified (including some internal ones the user could never specify). OBS_GetMeta can be operated with a very minimal set of options -- just a numer_id and a geometry is sufficient -- and it will return a rich JSON blob with all the other information filled in, which could be used to prepopulate dropdowns if necessary.

talos commented 7 years ago

Documentation for OBS_GetMeta can be found here.

nobuti commented 7 years ago

Thanks @talos, what I meant by description for filters is this. In the second step on the measurements dropdown, the user can access to filters:

filter-descriptions

talos commented 7 years ago

Got it. I'll include examples of how we can get that, too.

talos commented 7 years ago

@nobuti I've put together an extensive gist that answers all your questions, as well as provides sample API functions for all elements of the interface where they're needed.

Unfortunately, I was not able to populate the responses yet because of the ongoing S3 outage. Once that's fixed, I can go back and populate the missing example responses.

talos commented 7 years ago

@nobuti Full docs for those functions are now available here and should hopefully be merged in ASAP.

xavijam commented 7 years ago

@talos, in this request:

SELECT * FROM OBS_GetAvailableNumerators(
  <View or data boundary>,
  <Postgres Array of tags to filter by, AND'ed>
) numbers

Could you explain an example using view and another using data boundary?

nobuti commented 7 years ago

Also, this parameter is coming from the regions field in the UI?

talos commented 7 years ago

@xavijam I'm not sure I understand -- it works the same way with view or data boundary, you're passing in a Geometry to the API. I don't know internally how you obtain either from the user.

@nobuti The tags to filter by section should come from the regions query, in particular the region_id. It could be NULL if we had the "data extents" option, but if we don't it will always be populated.

xavijam commented 7 years ago

What is View? -> could you paste an example of this request? Data boundary -> bounds of the region using ST_MakeEnvelope?

talos commented 7 years ago

Example of making this request by view:

SELECT * FROM OBS_GetAvailableNumerators(
  ST_MakeEnvelope(<map left>, <map bottom>, <map right>, <map top>, 4326)
)

By data boundary:

SELECT * FROM OBS_GetAvailableNumerators(
  ST_SetSRID((SELECT ST_Extent(the_geom) FROM <usertable>), 4326)
)

Does that make sense @xavijam ? The same semantics would apply for all the other meta functions.

nobuti commented 7 years ago

@talos we are a bit concerned about the number of requests needed to run the analysis. I know pretty much nothing, so let me dream here. And forgive me for it.

Ideally, we would need only one request, in a similar manner we have now using obs_legacybuildermetadata, is this possible?

If not possible, at least we should save some requests. For example, I think is possible to avoid the filter's request, a measurement already has a subsection/tag.whatever that is precisely the filter that it "belongs to". It would be nice to have a unique key for all, in order to simplify the filtering in the frontend. In the same manner, we should try to add the timespan, normalize or boundaries information to this.

On the other hand, the measurement API call has the heavier response with a difference. We should try to get rid of every data not needed for the UI. For instance, a measurement has this structure:

{
  numer_description: "",
  numer_id: "ca.statcan.cols_nhs.t013c049_f",
  numer_name: "Income taxes paid as a % of total income (female)",
  numer_weight: "1",
  numer_license: null,
  numer_source: null,
  numer_type: "Numeric",
  numer_aggregate: "quantile",
  numer_extra: null,
  numer_tags: {
    license/ca.statcan.license.statcan-license: "Statistics Canada Open Licence Agreement",
    section/tags.ca: "Canada",
    source/ca.statcan.license.statcan-nhs-2011: "Statistics Canada National Household Survey 2011",
    subsection/tags.income: "Income",
    unit/tags.ratio: "Ratio"
  },
  valid_denom: true,
  valid_geom: false, 
  valid_numer: true,
  valid_timespan: false
}

I can see some keys that they don't have a direct translation from the UI point of view, as numer_weight, numer_license (I think the real license is part of numer_tags.license/whatever), numer_source, numer_type, numer_aggregate, numer_extra or unit/tags.ratio, valid_denom, valid_geom... Do we need them to run the analysis?

And finally, I find the API a bit overwhelming from the frontend point of view. Could we have some wrappers to encapsulate all the logic? Below I write down the signature of these wrappers (keep in mind that their name as just suggestions, and maybe even they are wrong or incomplete due to my lack of understanding).

Like I said, in the best case scenario, we would need this request:

If not possible, it would be nice to have two requests max, one for regions and one for measurements:

In the last term, if not of above is possible, we will have other 3 requests to get the rest of the data:

What do you think? Is any of this possible?

talos commented 7 years ago

Thanks for all the comments! Very much appreciated.

There are a few columns in OBS_GetAvailableNumerators that you are correct are not especially needed. Some, like weight and license could simply be eliminated.

While I generally prefer SELECT *, if you wanted to reduce the weight of the response the simplest approach would be to SELECT only the columns you needed -- excluding the license, weight, valid, etc.

@saleiva mentioned the responses are coming in heavier than you'd like. Do you have an example request so I could think of ways to trim it down?

I'm not sure I see how we could get rid of requests subsequent to the main measurements one (geometry, timespan, normalization). Geometries, timespans, and normalizations are all shared between measurements, so to include them nested would explode the size of the response. And each time a different measurement is selected, the exact set that works changes, so I don't see any way to avoid subsequent requests on selection change.

For example, if someone chooses "Total Population" in the USA, then "2011 - 2015" for timespan, they will have county, census tract, zip code, and many other geometry options.

However, if they switch their timespan to "2006 to 2010", they will no longer have zip code as an option, as that's not a geometry that was published by the Census in that release. I don't see any practical way to capture that information without either very large, complex nested objects (which will excacerbate existing response sizes) or making several requests (which is what we're doing now).

There may be some reasoning behind combining the initial region selector and measurements options into one request. Indeed, it should be possible to rewrite that SQL into a nested version that could achieve that. However, it would likely again somewhat bloat response sizes.

That would look something like this:

One could argue for having that first one be absorbed into the extension API, but to me it seems very specific to the Builder implementation and thus better left as custom SQL outside the extension.

nobuti commented 7 years ago

Perfect @talos, we will do that. I certainly will need your expertise for specifics along the way, so forgive in advance for repeated pings in the future. If you don't mind, let's leave this issue open for these questions.

talos commented 7 years ago

Sounds good to me! @nobuti would you like me to draft that obs_measurements_by_region function?

nobuti commented 7 years ago

@talos what I meant is we are going to use the API without any wrapper or extra function, don't worry. I'm going to need some help to fine tune the queries, though.

nobuti commented 7 years ago

@talos I have this call for measurements by region according to the docs:

http://cdb.localhost.lan:8080/api/v2/sql?q=SELECT numer_id, numer_name, numer_description, numer_tags FROM OBS_GetAvailableNumerators(ST_SetSRID((SELECT ST_Extent(the_geom) FROM {dataset}), 4326), '{section/tags.united_states}') numers&api_key={api_key}

This yields to:

{
  rows: [
    {
        numer_id: "us.zillow.AllHomes_Zhvi",
        numer_name: "Zillow Home Value Index for All homes",
        numer_description: null,
        numer_tags: "{"unit/tags.index": "Index", "subsection/tags.housing": "Housing", "section/tags.united_states": "United States", "subsection/us.zillow.indexes": "Zillow Home Value and Rental Indexes", "source/us.zillow.zillow-source": "Zillow Data", "license/us.zillow.zillow-license": "Zillow Terms of Use for \"Aggregate Data\""}"
    },
...
  ]
}
talos commented 7 years ago

Just to confirm, can we consider subsection/tags% the filter the measurement belongs to?

That is correct

I found some measurements don't have subsection/tags% in my local environment, is this possible? Does this mean this measurement doesn't have a filter to be categorised?

That is also correct. We have some measurements, for example the quantiles for the USA, that are not currently attached to a subsection. This means that they don't appear in the Catalog, and it would be best to exclude them entirely for the time being.

nobuti commented 7 years ago

@talos one question related to the data boundary. This query to get measurements:

SELECT * FROM OBS_GetAvailableNumerators(
  ST_SetSRID((SELECT ST_Extent(the_geom) FROM <usertable>), 4326)
)

if there is already an analysis applied to that node, should it remain as it is? I mean, in order to get the data boundary, should we use ST_SetSRID((SELECT ST_Extent(the_geom) FROM <usertable>), 4326) no matter custom queries or analysis already applied?

talos commented 7 years ago

Good question @nobuti . I defer to @rochoa's knowledge of camshaft internals on that one -- we should be using whatever query would get us the data extents from the last analysis in the pipeline.

rochoa commented 7 years ago

@talos, sorry, but how is this related to camshaft?

@nobuti, I guess we can use the same approach we use for determining the geometry type of an analysis. However, this might be a bit more problematic as doing an ST_Extent means we need to full scan the analysis query/table (given no index is created and we cannot use ST_EstimatedExtent for the queries).

talos commented 7 years ago

@rochoa It may not be, but I thought it may be and that you would know. It sounds like the answer is "no," then. I asked because I have not written these UI components before, and am not a good source of truth for how to introspect user data. I thought that since the question was related to analysis chains, it may involve Camshaft.

It sounds like you answered the question. I don't have any insight as to how to provide a more performant spatial extent in the absence of indexes. I suppose we could use the extent from the original table, and see if any funky edge cases crop up. For the most part the original user data should bear a strong relation to the user data even after several chains of analysis.

rochoa commented 7 years ago

@talos Thanks :). I was wondering if we could provide anything from camshaft to facilitate the solution, but as you might have realised, it's not that easy.

About using the original table extent, that could work in some scenarios, but once we apply some filters (via widgets) to source nodes, the extent will not be very representative, I guess.

talos commented 7 years ago

The more I think about it, the more I think it's OK to use the original table extents. Widget filters are normally applied after the analysis pipeline is built instead of before, and besides people should be able to pick the correct country if we default to the wrong one.

xavijam commented 7 years ago

Hey guys, we have a question about the column that is created for DO analysis. Seems like @talos mentioned that we need to make a request in order to obtain the name of the column. But in the other side, camshaft is waiting for the name of the column https://github.com/CartoDB/camshaft/blob/master/reference/versions/0.47.0/reference.json#L283.

From our point of view, we have to change it in any place. It doesn't make any sense to make a request for later provide it to camshaft. Could we just set the column the user wants like before? Or not provide it and just populate the one that DO provides?

talos commented 7 years ago

That camshaft node doesn't support multiple column outputs.

@rochoa we'll need a new Camshaft node to handle the new DO interface, no?

I'd be in favor of not letting the user pick, as opposed to forcing the user to pick each time -- remember, we can come up with high-quality names, and they might select a lot of different columns.

rochoa commented 7 years ago

@talos, yes, we need a new node:

talos commented 7 years ago

@rochoa Unless we can make multiple SQL calls in a single Camshaft node, this will be problematic -- basic issue is that neither pure SQL nor PL/pgSQL can handle returning arbitrary columns. That's why obs_getdata was implemented to return JSON.

However, if we can make two separate calls, and do the custom query writing in JS, then we're OK.

Happy to hop on a quick call tomorrow to try to figure out an implementation strategy.

rochoa commented 7 years ago

@talos, let's talk today. However, Camshaft's current design doesn't allow doing more than one query per node.

cc @jorgesancha

talos commented 7 years ago

Yes, let's talk today then.

It's worth pointing out that we still did more than one query with the old design too. I think we could reuse that model -- make the metadata calls in the UI, then pass the results of those to the Camshaft node. Those metadata UI calls already include column naming information, so we should be OK.

On Fri, Mar 31, 2017 at 4:17 AM, Raul Ochoa notifications@github.com wrote:

@talos https://github.com/talos, let's talk today. However, Camshaft's current design doesn't allow doing more than one query per node.

cc @jorgesancha https://github.com/jorgesancha

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CartoDB/cartodb/issues/11663#issuecomment-290647910, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZYV1KiRlmZiJ_vA5iIObvNMs2ECISBks5rrLa0gaJpZM4MOTZT .

rochoa commented 7 years ago

After discussing the problem with @xavijam, @nobuti, and @jorgesancha, and with the already mentioned limitation in Camshaft, the frontend will take care of sending, both, the different measurements and the column names, as proposed in https://gist.github.com/talos/554c7759bafe53faafd23a9711305220#answering-questions-from-issue-11663.

However, the column name generation should happen and be returned from the call to OBS_GetMeta, so instead of delegating the column name composition to the client the API will be responsible for that. @talos, apart from the current fields the response should contain a suggested_column_name, or similar, field. Do I need to create an issue for you to handle that new feature?

On Wednesday, I will work with @nobuti to see if we can improve the situation with the number of requests. Meanwhile, I will create a new Camshaft analysis to be able to use multiple measurements/columns.

talos commented 7 years ago

Sounds good. Yes, please make a ticket to add a suggested_column_name from OBS_GetMeta.

My gentle suggestion for simplest implementation of camshaft node would be that it takes a JSON argument, meta, which is directly from the OBS_GetMeta called from the UI.

On Fri, Mar 31, 2017 at 12:06 PM, Raul Ochoa notifications@github.com wrote:

After discussing the problem with @xavijam https://github.com/xavijam, @nobuti https://github.com/nobuti, and @jorgesancha https://github.com/jorgesancha, and with the already mentioned limitation in Camshaft, the frontend will take care of sending, both, the different measurements and the column names, as proposed in https://gist.github.com/talos/554c7759bafe53faafd23a97113052 20#answering-questions-from-issue-11663.

However, the column name generation should happen and be returned from the call to OBS_GetMeta, so instead of delegating the column name composition to the client the API will be responsible for that. @talos https://github.com/talos, apart from the current fields the response should contain a suggested_column_name, or similar, field. Do I need to create an issue for you to handle that new feature?

On Wednesday, I will work with @nobuti https://github.com/nobuti to see if we can improve the situation with the number of requests. Meanwhile, I will create a new Camshaft analysis to be able to use multiple measurements/columns.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CartoDB/cartodb/issues/11663#issuecomment-290755533, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZYV0HbSz-we5MjwIYUIJU45D9KAMCWks5rrSSTgaJpZM4MOTZT .

rochoa commented 7 years ago

@talos, can you explain a bit more, an example would be awesome, about the meta argument?

talos commented 7 years ago

Yes! Will do that this morning.

talos commented 7 years ago

@rochoa OK, I think this explains it pretty well:

https://gist.github.com/talos/97a0f1ac1bb101cd4533dd5d892bc4a1

I haven't tested the above as some key components like suggested_colname are missing, and I'm not sure if the Node.PARAM.ARRAY() type handles arbitrary JSON elements so easily. But the theory is all there.

When the user presses "apply" in the UI, a call to OBS_GetMeta with all their selected inputs should be run in the browser. The result of that is a JSON array, which can be passed to the camshaft node above, which handles the OBS_GetData call, column naming, and joining back to the source dataset.

rochoa commented 7 years ago

The problem I see with that approach is that it's not clear for the end user what it needs to send, that's, what is the relevant part? Furthermore, Camshaft discourages the usage of complex params because that will expose a complex API to the end-user.

However, I like your idea of having pairs `[['us.census.acs.B01003001', 'colname1'], ['us.census.acs.B01003001', 'colname2']].

I'll CC you in the CR, ok?

talos commented 7 years ago

OK -- because GetData requires the output of GetMeta to work, and there's no way to dynamically construct a query with the results of GetMeta without making two queries, I see no way to avoid this. But I'm all ears if you figure something else out.

nobuti commented 7 years ago

Hi @talos, I'm working with OBS_GetMeta, and I have a few questions:

talos commented 7 years ago

From the docs:

normalization: The desired normalization. One of ‘area’, ‘prenormalized’, or ‘denominated’. ‘Area’ will normalize the measure per square kilometer, ‘prenormalized’ will return the original value, and ‘denominated’ will normalize by a denominator. Ignored if this metadata object specifies a geometry.

Unfortunately, the docs are currently missing an example of numer_timespan in action with OBS_GetMeta. Since the purpose of the argument is to specify a timespan, you should pass one in. From the above, it looks like you passed in another measure ID. Try this:

SELECT OBS_GetMeta((SELECT ST_SetSRID(ST_Extent(the_geom), 4326) FROM (SELECT * FROM cdb.walmart_copy) q), '[{"numer_id":"us.census.acs.B03002002","numer_timespan":"2011 - 2015"}]')

Let me know how that works, @nobuti .

talos commented 7 years ago

Also, RE denominated -- if no denom_id is specified, OBS_GetMeta will choose (and return that choice) essentially at random.

xavijam commented 7 years ago

Closing because it is already working but we are polishing it 🎉 .