UrbanCCD-UChicago / plenario2

The Premier Open Data Discovery Platform
http://plenar.io/
Other
12 stars 2 forks source link

Create a new DisplayField Schema #264

Closed vforgione closed 6 years ago

vforgione commented 6 years ago

We should add another schema that holds information about what fields should be displayed on search results.

For most cases, this is similar to the virtual fields and constraints: give the owner a list of the fields attached to the meta. Where this gets tricky is with jsonb fields. For these, we need to inspect unique paths within the JSON and provide them as well.

For example, if the data set has fields id, lat, lon, timestamp and observations, where observations is a JSON field with the following structure:

{
  "sensor 1": {
    "temperature": "whatever"
  },
  "sensor 2": {
    "temperature": "whatever",
    "humidity": "whatever
  }
}

we should display:

HeyZoos commented 6 years ago

When you say display do you mean API results or frontend results? Would this schema be used to return json objects that look like:

{
    "observations.sensor1.temperature": 1,
    "observations.sensor2.temperature": 2,
    "observations.sensor2.humidity": 3
}
vforgione commented 6 years ago

Front end results: search specifically. The way the graphs are shown right now in the general explorer app is I grab the first few numeric fields of a data set and graph them out, regardless if they are significant to the actual data set. It's a very naive approach.

In the case of AoT, or any other data set that has embedded JSON, I use a dummy set of fields and values to draw the graphs.

What I want to be able to do is give data set owners the ability to highlight significant fields in their data sets to be used in displaying search results.

HeyZoos commented 6 years ago

So a DisplayField should give me the information I need to dive down into a json field and extract a value?

HeyZoos commented 6 years ago

I think it's a little overkill, especially if it's just for displaying search results. I could see providing those values through a simple query with json_object_keys and the json operators postgres provides.

vforgione commented 6 years ago

Kind of. I'm thinking of it like this,

fields = DisplayFieldActions.list(for_meta: meta)
query = """
select
  <%= for field <- fields do %>
  avg(<%= field %>,
  <% end %>
  ...
"""

or something less crappy. You get the idea though.

Edit: ideally, we would use an ecto query and the model for the data set from the registry.

vforgione commented 6 years ago

The thing is, I have no way of limiting fields to display in search results without some arbitrary cut off. Especially when it comes to jsonb fields, doing a full introspection and then graphing those values not only looks like garbage on the front end, but can potentially be dangerously expensive to the DB and the web server.

This is a simple solution to that.

HeyZoos commented 6 years ago

Do you mean limiting the number of fields to display? I feel like the cutoff is still arbitrary with the new schema. I also don't see how it's expensive for the database. We assume that the json values all have the same "shape", so it's a matter of grabbing the first row, deriving the structure from that, and using that information in your subsequent queries.

In regards to the snippet you posted, you'd grab the fields value with a postgres query that maps the key structure of the first row. It's the same number of trips to the database but without a new schema that introduces 1:M (Meta to DisplayField) and 1:1 (DisplayField to DataSetField) relationships.

vforgione commented 6 years ago

Here are some counter examples. In the case of Beach Lab Culture Tests we have the following rows;

Name Type
Beach text
Culture Note text
Culture Reading Mean float
Culture Sample 1 Reading float
Culture Sample 1 Timestamp timestamptz
Culture Sample 2 Reading float
Culture Sample 2 Timestamp timestamptz
Culture Sample Interval integer
Culture Test ID integer
DNA Reading Mean float
DNA Sample 1 Reading float
DNA Sample 2 Reading float
DNA Sample Timestamp timestamptz
DNA Test ID integer
Latitude float
Location text
Longitude float

These rows include DNA testing data because the source of the data didn't exclude them from the results for whatever reason. If we were to display every numeric field in here it would be 1) crowded given the number of fields and 2) include fields that have little to no relevance to the core of the data set.

In the case of Array of Things data, not every observation sub-document contains identical keys. It's true that most nodes have a relatively stable set of sensors, but as the builds continue boards will be added and removed. Additionally, on the back end of AoT they scrub certain malformed data (such as temperatures that read 98 bajillion degrees). If we were to naively grab the first row we could not guarantee that it is an accurate representation of the data set. Additionally, jsonb_object_keys only gets the top level keys:

 node_id |       array_agg        
---------+------------------------
 034     | {BMP180,HTU21D,TSYS01}
 056     | {BMP180,HTU21D,TSYS01}
 032     | {BMP180,HTU21D,TSYS01}
 048     | {BMP180,HTU21D,TSYS01}
 08D     | {BMP180,HTU21D,TSYS01}
 091     | {BMP180,HTU21D,TSYS01}
 03D     | {HTU21D,TSYS01}
 05D     | {BMP180,HTU21D,TSYS01}
 053     | {BMP180,HTU21D,TSYS01}
 086     | {BMP180,HTU21D,TSYS01}
 06B     | {BMP180,HTU21D,TSYS01}
 037     | {BMP180,HTU21D,TSYS01}
 00D     | {BMP180,HTU21D,TSYS01}
...

Note that this is an aggregate of all the records. For most of these nodes they only had 2 sensors reporting in for a while - only recently did a third board start reporting in, and there's like a dozen others still to come.

So, to me, having a predetermined, limited set of fields that we pull from to draw graphs for search results greatly simplifies mitigating these pitfalls.

vforgione commented 6 years ago

Idea to surface key paths from embedded JSONB fields:

create or replace function jsonb_key_paths(_value jsonb)
  returns text[]
  language sql as
$$
with recursive _struct (key, value) as (

  select
    t.key,
    t.value

  from
    jsonb_each(_value) as t

  union all

  select
    concat(_struct.key, '.', t.key),
    t.value

  from
    _struct,
    jsonb_each(
      case
        when jsonb_typeof(_struct.value) <> 'object'
        then '{}'::jsonb
        else _struct.value
      end
    ) as t

)

select array(
  select distinct key
  from _struct
  where jsonb_typeof(_struct.value) not in ('array', 'object')
  order by key
)
$$;

In the case of AoT Chicago, we have a data set table with an observations field that has non-deterministic depth and breadth data in JSON format. We want to be able to surface those embedded keys so that we can highlight them in graphs. Using a function like this one, we can show the data set owner a list of keys that exist in the data set that they can use to build graphs.

For example, the result of the function is

select node_id, timestamp, observations, jsonb_key_paths(observations) from "ds_no3g4DGcITlkTTYu" limit 10;

 node_id |       timestamp        |                                                              observations                                                              |                              jsonb_key_paths
---------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------
 02F     | 2018-03-16 09:19:17-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 71.94, "temperature": -0.78}, "TSYS01": {"temperature": -1.1}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:19:40-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.52, "temperature": -0.72}, "TSYS01": {"temperature": -1.04}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:03-05 | {"BMP180": {"temperature": -1.7000000000000002}, "HTU21D": {"humidity": 71.33, "temperature": -0.79}, "TSYS01": {"temperature": -1.1}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:29-05 | {"BMP180": {"temperature": -1.7000000000000002}, "HTU21D": {"humidity": 71.29, "temperature": -0.8}, "TSYS01": {"temperature": -1.15}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:20:52-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 72.51, "temperature": -0.72}, "TSYS01": {"temperature": -1.07}}              | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:21:15-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.33, "temperature": -0.75}, "TSYS01": {"temperature": -1.09}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:21:39-05 | {"BMP180": {"temperature": -1.65}, "HTU21D": {"humidity": 72.3, "temperature": -0.74}, "TSYS01": {"temperature": -1.1}}                | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:02-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.8, "temperature": -0.72}, "TSYS01": {"temperature": -1.08}}                | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:26-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 73.07, "temperature": -0.67}, "TSYS01": {"temperature": -1.02}}               | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
 02F     | 2018-03-16 09:22:50-05 | {"BMP180": {"temperature": -1.6}, "HTU21D": {"humidity": 72.77, "temperature": -0.6900000000000001}, "TSYS01": {"temperature": -1.05}} | {BMP180.temperature,HTU21D.humidity,HTU21D.temperature,TSYS01.temperature}
(10 rows)

which we can further aggregate to get a master list of keys for a given data set.

Or, a more simple example:

plenario_dev=# SELECT jsonb_key_paths('{"foo":{"bar":"baz","boom":"bang","bleh":{"barf":"yack"}}, "top": "level"}');
           jsonb_key_paths            
--------------------------------------
 {foo.bar, foo.bleh.barf, foo.boom, top}
(1 row)