Aggregating records - Githubissues

paul121 commented 4 years ago

I've been wanting to brainstorm the "aggregating of records" piece for a while now. Discussing the crowdsource aggregator in farmOS/farmOS#206 got me thinking, so I took the chance and kept going...

Right now the aggregator allows us to push and pull records via the Aggregator API to multiple farmOS servers. This is great, but there seems to be a need to "track" certain types of records across multiple farms. There is also a need to "cache" a subset of the data from farmOS servers so that servers are not constantly supplying data. It would be great if we could do this in a reusable way that doesn't require custom modifications to the farmOS-Aggregator instance.

I'm proposing a way of creating reusable "trackers" for the aggregator. (Note: "Tracker" is the best term I could think of - I think it could be improved!)

A "tracker" would define rules to "cache" records from farmOS servers. Perhaps with PostgreSQL JSONB Columns, It could be fairly easy to save data in the database, without having to add tables for each individual "tracker".
Because a "tracker" would be defined in its own python file, I think they could be shared easily among community aggregators.
If the Aggregator provided a list of "Trackers" via API, an abstract frontend UI could be built to visualize each "Tracker" with simple lists of records, geographically, graphs, etc (more here https://github.com/farmOS/farmOS/issues/206#issuecomment-570126664)

As an example: If we want to aggregate "First Planting Dates", we could configure the aggregator backend to cache all seeding logs in the db - this could be saved in a tracking table, with a name "first planting tracker". Then, in the UI, there could be a view generated for each tracker in the tracking table. The views could visualize this data in different ways (list the records, map geographically, graphs, averages, etc...) depending on the type of record. A different community might aggregate potato harvests in a similar way - they might configure the aggregator to cache all harvest logs for potato crops, then visualize harvest quantities, harvest dates, harvest photos, etc... a simple tracker could even just save the "Number of Compost Piles" or "Number of Animals" without any additional data.

I think areas and assets could be aggregated similarly. A rule could be added to cache all field areas to get total acreage, or even greenhouse areas and get sq footage of greenhouse space. Caching farm assets might provide animal head counts, number of crops grown, type/quantity of farm equipment, etc...

Another thought: I think a "tracker" could be configured to cross reference records. An example (similar to produce quality study): track spinach crops & field history / growing practices. The tracker might cross reference a planting asset and the field area it was grown in: The rule would cache all seeding logs and harvest logs created for a planting asset of crop == spinach, and cache all activity logs and input logs for the field it was grown in. This could be saved in one custom object as a row in the db table: (Good use case for JSONB? :D)

# This represents an entire data set
aggregated_spinach_data = [

    # one row of data
    {
        # farm ID in the aggregator
        'farmID': 31,

        'spinach_crops': [
            # an object for each spinach crop grown on the farm
            {
                # a farmOS asset record
                'planting_asset': {'id': 84, 'name': 'North spinach crop', 'type':'planting', 'crop': 'spinach', },

                # seeding and harvest logs for this asset
                'logs': [
                    {'id': 131, 'type': 'seeding', ...},
                    {'id': 245, 'type': 'harvest', 'quantity': '400 pounds', ...},
                ],

                # fields that the planting was grown in
                'fields': [
                    {
                        # farmOS area record
                        'area': {'id': 5, 'name': 'North Field', ...},
                        # input and activity logs for the area
                        'logs': [
                            {'id': 100, 'type': 'input', ...},
                            {'id': 89, 'type': 'activity', ...},
                        ]
                    },
                    # another field it was grown in
                    { ... }
                ]
            }
        ],
    },
    # other farms with aggregated data, like above.. each a row in the DB table..
    { ... },
    { ... },
    { ... }
]

A cool thing about this approach: each "tracker" could be defined in its own python file, and imported into the aggregator on startup. This means a "tracker" could be shared with other aggregators by just sharing the python file. This file would basically just define the "rules" that would cache records from the farmOS server. Here, the aggregator cron job might call the get_farm_data method of the Tracker on a regular schedule, and save the returned data in the Tracker db table with the farm id - a simple file for Planting Dates might look like the following:

from farmosaggregator import Tracker

# Define a tracker that inherits from the Aggregator Tracker base class
class FirstPlantingTracker(Tracker):
    def __init__(self):
        self.name = "First Planting Tracker"
        self.description = "Aggregate seeding logs to determine first planting dates."

    def get_farm_data(self, farmos_client):

        seeding_logs = farmos_client.log.get(filters={'type': 'farm_seeding', 'done': True})

        # This would be one farms data for the Tracker
        # One row in the db table
        return seeding_logs

I drafted a simple one for "Number of Compost Piles" and a more complicated one for the "Spinach Study" (only proof of concept!) https://gist.github.com/paul121/463fdd02deec767ce5a1374c3e17c303

Each tracker could be configured to keep the most recent data set from a farmOS server, or track data over time.

An example might be tracking "number of crops grown over time" - each time a farmOS server is queried, a new entry could be saved with a timestamp.
In other cases, like "First Planting Dates", a set of seeding logs cached in the DB could be replaced each time they are updated - seeding logs shouldn't disappear overtime, so there isn't a reason to keep multiple versions.

Alternatively, the Aggregator could keep more general "cache" tables for logs, areas and assets. Records that are used in a "Tracker" could be saved to the Aggregator's general cache of all farm records, eliminating duplication of cached records. For this, instead of saving all the record data (like above) a "Tracker" would only save an obj with IDs linking to the cached records.

paul121 commented 4 years ago

Something else I realized regarding "cross referencing records" is that this could be accomplished with a custom query to the DB. Right now we are slightly limited by the RESTws API, but GraphQL might be an alternative solution down the road.... each "tracker" could provide a GraphQL query, run by farmOS.py, that returns the relational & computed fields from the DB

mstenta commented 4 years ago

Very interesting idea! I like the re-usability aspect of it.

Worth noting (@paul121 and I just discussed this yesterday, so more for other readers) farmOS has a concept of "metrics" that modules can provide, which include things like total field acreage, animal head counts, etc. I think those will be a good test case to try in the community aggregator.

I'm still a bit on the fence over whether or not this information should actually be cached in the aggregator database itself, though. Or if that decision should really be left up to the downstream system that's using the aggregator. Caching is an added layer of complexity that needs to be understood and supported with more helper code, and is guaranteed to cause frustration and confusion to someone (just the nature of caching haha).

If the reason for doing this is for performance, then I'd say it would be better to wait until performance is actually an issue before we add complexity. Or take the stance that it's a downstream decision. I think my preference would be to keep the Aggregator itself as thin as possible, but that might change/evolve as we have more real-world deployments and use-cases. Another possibility would be to maintain some standalone plugin libraries that downstream aggregator users could use to build these kinds of decisions into their own apps in standardized/reusable ways. So for example: the caching layer could be built in the community aggregator app, and perhaps done in a standalone library (python or node.js) that could be reused, if that makes sense.

farmOS / farmOS-aggregator

Aggregating records #62