Should resources in a document be "by-reference" or "by-value"?

strhea commented 1 year ago

I posit that the operations data would be easier to consume if each record was self-contained. Consider how this data is likely to be moved around.

The easy case - I want to pull a single operation from a partner system. What does that look like? Currently, an ApplicationDataModel object containing a populated Catalog object AND a populated Documents object. If we're talking JSON response serialization that's probably reasonable; though I wonder about the clean/user-friendliness of having to dig down through container objects. But - I can take that apart with zero-ambiguity in one bite (plus fetching the referenced geoparquet file). If I've chosen to serialize the ADM object to a file, then maybe I have a zip that contains the adm.json and the geoparquet - also pretty clean to handle and easy to scale the processing by handling multiple file pairs in parallel, right?

The complicated case - I want to pull 1..n seasons of operation data from a partner system in a single object. What does that look like? More or less the same, an ApplicationDataModel object containing a populated Catalog object AND a populated Documents object. But there are some potential concerns:

a single JSON representation of this data getting too big
having to manage multiple "versions" of the same resource that changed over seasons (fields moved from one farm to another, names that are changed, fields that are split, etc)

If we're talking an API JSON response, there is going to be some practical limit to the size of the returned ADM object. Yes, there are ways around it. Yes, you could enforce some pagination or fixed segmented return of data. But isn't that added complexity on both sides? If I've chosen to serialize the ADM object to a file, then maybe I have a zip that contains a BIG adm.json and all the geoparquet files. Now I have to load the whole ADM object into memory, right? But that was one of the issues of the ADAPT toolkit, we had to play games with lazy loading to get around having a large memory footprint. This is another reason I really like pointing to external files for the heavy vector and raster data.

Maybe I'm off-base here and this is again dragging too much in the way of serialization concerns into the model. I do think it's a good idea for us to question the "data card" perspective of the ADAPT Toolkit, especially since the OEMs don't seem to be focused on sharing data in that way anymore.

Thoughts? @knelson-farmbeltnorth @zwing99 @crutt

knelson-farmbeltnorth commented 1 year ago

We discussed this item at length in today's serialization call as well as in last week's standard call. I like the theory that ADAPT should be flexible in a way that doesn't require every dataset to contain an ADM root, catalog, documents, etc., but I don't think we've yet found a way to model things consistently without doing it that way. If I want to send one WorkRecord and its associated Grower, Farm and Field, it would be easy enough to model resources by reference. If I want to send a dozen, however, now I need to send multiple copies of the same resources in the transfer. If I include things such as field boundaries and guidance patterns, the duplication is non-trivial.

knelson-farmbeltnorth commented 11 months ago

Per discussion with @strhea , there is general agreement to continue the path with references in the Catalog for the foreseeable future.

ADAPT / Standard

Should resources in a document be "by-reference" or "by-value"? #121