LLNL / Sina

Store and query simulation (meta)data to/from various backends using friendly Python
MIT License
6 stars 2 forks source link

Nested/Dictionary data values #6

Open justinlaughlin opened 7 months ago

justinlaughlin commented 7 months ago

Hello,

It seems like it is possible to create a Record with a dictionary as its data value, but not possible to insert that Record into a DataStore. Here is a minimal example

from sina.model import Record
from sina.datastore import create_datastore

rec = Record('a', 'A')
rec.data_values['field'] = {'b': 5, 'c': 6}

ds = create_datastore('test.sql')
ds.records.insert(rec)

returns

ValueError: ['At least one data entry belonging to Record a has a dictionary for a value.Value: field']

I figured this would not be allowed as each data can also have a units field, but I have heard that there is some sort of option to use nested data. If there is a way to do so I would love to know more. Thanks.

justinlaughlin commented 7 months ago

I am composing DataStore into my own class to add in nested data functionality atm, but if there is a more natural way to do this that'd be preferred.

justinlaughlin commented 7 months ago

Pinging @HaluskaR since this is related to your L2 code.

HaluskaR commented 7 months ago

Hey there @justinlaughlin! I can/should add some validation for the data adders, I'll get a ticket up on my end.

On the topic of nested data: the limitations are due to the nature of the field. data holds values that are queryable, so they need to follow a structure the datastore knows how to query. Hierarchical is fine, arbitrarily hierarchical is where we run into issues (what does each layer of the hierarchy represent, how does the user "intend" to query across children and children's children, etc), so what you'll want to do depends on your needs:

1) we can make it non-arbitrary by formalizing the structure so the datastore can be smart about it. That's things like curve_sets and library_data, where there was a recurring form of data between codes that people would want to query in a "hierarchical" way. Of course that one usually takes the longest since it's an addition to the underlying schema, and we'd want to make sure it's workable for a lot of codes/uses (currently undergoing this process for materials), but that's only if you want it on Sina's end--if you're expanding on Sina for a specific use, I'm happy to run you through how to set something like that up.

2) make it "non-hierarchical" (ex: coerce my { cool { data: 12 to be my/cool/data: 12). This loses out on some implied flexibility (and making it more annoying to type) but makes it immediately available. In my experience, this one is handy to invoke when the hierarchy is more a side effect of the structure/code output rather than representative of how you'd want to query things, where you're always going to the leaf of the tree.

3) make it non-queryable. Records have a user_defined section for storing any legal JSON (or anything you're willing to stringblob into JSON) that we won't try to index/validate/etc, simply storing and returning it exactly as given. This is probably the one you heard of, it's the most commonly useful, since a lot of nested data we've encountered are things users only want in context of the run itself.

Thanks for the catch on adding data!

justinlaughlin commented 7 months ago

Hi, thanks for the super detailed reply! The way I've structured the data your second suggestion makes the most sense - as it is only the "leaf nodes" that are being accessed. The hierarchies serve no purpose other than to organize data. It seems like user_defined or library_data would have also been able to perform the same role but in this case the data is not actually user defined/library data so I decided to use a "delimiter convention" and just store it in data. E.g.

registration.user = ...
registration.date = ...
registration.time = ...

I have a few helper functions that help to lift/flatten between the nested dictionary and the "delimited key" dictionary. Of course this option requires a common agreement of what is being used as a delimiter, but since this is at the implementation level and not really exposed to the user its not a huge worry. I think Alex had mentioned that a /-delimited convention may already be in use for data but I couldn't get it to work. It would be nice to be "officially" conforming but this seems like a good enough solution.

If this is a common scenario then maybe something like these lift and flatten functions could be added to sina.util? Or if data is entered as nested dictionaries then maybe sina could assume to keep going until it reaches the "leaf"? I don't think you could mix different types of hierarchies with this solution (e.g. list[dict, dict]) but you could at least have a nested dictionary (e.g. dict[dict[list], int]) where the lowest level are non-dict values.

justinlaughlin commented 7 months ago

After browsing the source code a bit I think this flattening already exists and is applied automatically to library data.

sina/model/flatten_library_content