brianfoshee / aquaponics-data

0 stars 0 forks source link

Update Database Schema #9

Closed nathanprayzo closed 9 years ago

nathanprayzo commented 9 years ago

I think we should move forward with the schema you outlined previously:

-- Add a table for devices. Would make it easier in the long term to add other attributes
-- to a 'device' such as user info, rpi info, firmware info etc.
CREATE TABLE device (
  id uuid, -- This is for internal use as a primary key
  identifier character varying, -- no performance hit for unbounded character type see https://github.com/rails/rails/pull/9153
  updated_at timestamp,
  created_at timestamp
);

CREATE TABLE reading (
  device_id uuid,
  readings json,
  first_reading timestamp, -- This would be for grouping readings into daily/weekly/monthly
  last_reading timestamp,  -- depending on how many readings we want to keep per json document
  created_at timestamp
);
brianfoshee commented 9 years ago

I think it'd add a lot of overhead when adding devices to have a column for each device, and it wouldn't scale too well. I think sticking with something similar to the following would be alright. That, coupled with a postgres function similar to this to append readings to the json field.

{
    "123": {
        "ph": 3,
        "tds": 120
    }
}

{
    "reading": {
        "created_at": 123,
        "sensor_data": {
            "ph": 4,
            "tds": 120,
            "temp": 23,
        }
    }
}

Update: Final schema

CREATE EXTENSION "uuid-ossp";
CREATE TABLE device (
  -- id is for internal use as a primary key
  id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
  -- Device generated identifier
  identifier character varying NOT NULL,
  -- timestamps which set to current time in UTC on record creation
  updated_at timestamp without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'UTC'),
  created_at timestamp without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'UTC')
);

CREATE TABLE reading (
  device_id uuid NOT NULL REFERENCES device (id) ON DELETE CASCADE,
  readings json NOT NULL DEFAULT '{}'::json
);
CREATE INDEX reading_device_id ON reading (device_id);

-- SQL function to add a key to json field
CREATE OR REPLACE FUNCTION "json_object_set_key"(
  "json"          json,
  "key_to_set"    TEXT,
  "value_to_set"  anyelement
)
  RETURNS json
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT COALESCE(
  (SELECT ('{' || string_agg(to_json("key") || ':' || "value", ',') || '}')
     FROM (SELECT *
             FROM json_each("json")
            WHERE "key" <> "key_to_set"
            UNION ALL
           SELECT "key_to_set", to_json("value_to_set")) AS "fields"),
  '{}'
)::json
$function$;
brianfoshee commented 9 years ago

I think another option is postgres' HSTORE column type, which is a hash/key-value type. Most of the work on jsonb came out of improvements made to hstore for 9.4. The only limitation is that nested hashes aren't available on hstore until 9.4. It may be worth trying both types out and seeing what sort of flexibility each gives us, and there are always functions in postgres to convert between the two as well. Since postgres 9.4 isn't in RC yet (but I imagine it will be within the next month) we may need to run it on a test server. Heroku is usually good about adding support for the latest versions once they're out of beta though.

Here's the blog of the people that do most of the hstore and json work for postgres http://obartunov.livejournal.com/175235.html.

Edit: My point in suggesting hstore is that I believe it natively supports updating a column with new key/value pairs.

brianfoshee commented 9 years ago

@letsgitgrowing check out this issue on the postgres driver we're using. An example of a custom json encode/decoder. https://github.com/lib/pq/issues/230

I think that we'll need to add a decoder (MarshallJSON) to the MyTime type to go from database value to serialized in JSON.

brianfoshee commented 9 years ago

@letsgitgrowing this is an interesting project for managing database migrations and schema. Rails and other frameworks typically have a mechanism for managing these kinds of things, and this seems to be a similar approach.

I haven't checked into it this much, but if it had hooks that we could call in our code to setup and teardown the database then we can replace our create table and drop table calls in the database tests.

https://bitbucket.org/liamstask/goose

brianfoshee commented 9 years ago

@letsgitgrowing goose is ready to go, at least in development. I'm ready to push it to production to test out - is the data in production necessary or can I drop the tables? I'd like to test it out on heroku before I merge #21.

nathanprayzo commented 9 years ago

@crakalakin Regarding the schema for table 'reading', shouldn't the device_id column be a foreign key to device.identifier instead of to device.id?

CREATE TABLE reading ( device_id uuid NOT NULL REFERENCES device (identifier) ON DELETE CASCADE, readings json NOT NULL DEFAULT '{}'::json );

nathanprayzo commented 9 years ago

^ Disregard this comment

nathanprayzo commented 9 years ago

@crakalakin I have almost finished all of the sub-tasks for this issue but ran into a little snag modifying the handler functions.

The monitoring system is only going to be aware of its device.identifier but in order for its SensorData to be added to the database, the application needs to know the monitoring device's id. As such wouldn't the application need to perform a device.identifier -> device.id look up each time a new reading is added to the database? This seems really inefficient.

Am I missing something?

brianfoshee commented 9 years ago

@letsgitgrowing the way I envisioned it was passing the device identifier through to the model, and then the database can look up the id on update:

UPDATE reading
SET readings = json_object_set_key(readings, '2014-11-05T21:00:00Z', '{"ph":4, "tds":121}'::json)
WHERE device_id = ( -- get device_id using the device.identifier
  SELECT id
  FROM device
  WHERE identifier = 'ABC123'
);

I thought about using the actual device-generated identifier as the primary key for that table, but I would imagine that indexing on it would be pretty huge since it's a long string.

See this comment on the PR for more.

nathanprayzo commented 9 years ago

@crakalakin Alright, i'll implement it that way; we can work on optimizing after the system is functional and we have reliable benchmarks.

Thanks!

brianfoshee commented 9 years ago

Sounds good!

brianfoshee commented 9 years ago

@letsgitgrowing try this. The tests pass for me when I use it:

    b, er := json.Marshal(r.SensorData)
    if er != nil {
        return er
    }

    _, err := m.db.Exec(`
    UPDATE reading
    SET readings = json_object_set_key(readings, $1, $2::json)
    WHERE device_id = (
        SELECT id
        FROM device
        WHERE identifier = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a22'
    )`, t, b)
brianfoshee commented 9 years ago

@letsgitgrowing here's my first stab at getting json from the database into a struct:


func TestPostgresAddReading(t *testing.T) {
 //...
    var s []byte
    err = manager.db.QueryRow(`
    select to_json(readings->>$1)
    from reading
    where device_id = (
        select id
        from device
        where identifier = 'ABC123'
    )`, now).Scan(&s)

    switch {
    case err == sql.ErrNoRows:
        t.Error("Error no rows in table reading")
    case err != nil:
        t.Error("Error", err)
    }

    var data interface{} // still working on making this a models.SensorData type
    decoder := json.NewDecoder(bytes.NewBuffer(s))
    if err = decoder.Decode(&data); err != nil {
    }
    fmt.Println(data) // data is a map so data["ph"] etc would work also
//...
}

This prints out:

{"ph":6.8,"tds":120,"water_temperature":78}

Edit: removed uint conversion function in favor of []byte

brianfoshee commented 9 years ago

@letsgitgrowing also, this might come in handy:

select json_object_keys(readings)
from reading
where device_id = (
  select id
  from device
  where identifier = 'ABC123'
);

Returns:

   json_object_keys   
----------------------
 2014-11-05T21:00:00Z
 2014-12-01T23:00:00Z
(2 rows)
brianfoshee commented 9 years ago

@letsgitgrowing I found a solution to go from JSON in postgres to a models.SensorData struct. I'm sure that at some point we can figure out a way to reduce the two different json.Unmarshal calls into one, but for now it works.

    var s []byte
    err = manager.db.QueryRow(`
    select to_json(readings->>$1)
    from reading
    where device_id = (
        select id
        from device
        where identifier = 'ABC123'
    )`, now).Scan(&s)

    switch {
    case err == sql.ErrNoRows:
        t.Error("Error no rows in table reading")
    case err != nil:
        t.Error("Error", err)
    }

    // First unmarshal: unmarshal []byte into a string
    var d string
    err = json.Unmarshal(s, &d)
    if err != nil {
        fmt.Println(err)
        fmt.Println(string(s))
    }
    // ... Second unmarshal: unmarshal the above string into SensorData
    var bd models.SensorData
    json.Unmarshal([]byte(d), &bd)
    fmt.Println(bd)

Prints out:

{6.8 120 78}
brianfoshee commented 9 years ago

@letsgitgrowing we now have some really great operators on jsonb. For instance, the test we have which checks if we inserted readings by selecting the count of that key can be reduced to select readings ? '2014-11-05T21:00:00Z', where the ? operator returns true or false if the key is in the json. It also runs in 0.338ms.

nathanprayzo commented 9 years ago

@crakalakin after reading more into jsonb, it appears the main advantages of storing json in its binary form is faster data manipulation and indexing by the database engine. As our current strategy is to do all of the data manipulation client-side, i'm not convinced refactoring to utilize jsonb is a high-priority.

What are your thoughts?

Sources:

brianfoshee commented 9 years ago

@letsgitgrowing I'd agree to hold off on switching the column type until this branch is merged into master, but I think it'd still be a good idea to switch to it eventually (I believe this is all that's required). INSERTing may be a bit slower, but because the data is in a binary format it'd be faster to do something like pull all readings from between two timestamps. Also I was seeing much faster SELECT statements when using jsonb so pulling the data out may be faster in general.