How to handle data with samples

diffpy / sampledb

A DB for sample and synthesis metadata information

http://www.diffpy.org/sampledb/

Other

0 stars 2 forks source link

How to handle data with samples #19

Open CJ-Wright opened 6 years ago

CJ-Wright commented 6 years ago

Usually when we are taking data it gets put into a databroker. The databroker takes care of storing large data sets. However, we may get data with the samples (eg XRD patterns taken on a lab source) which is not in the databroker. How should we handle this?

The way I see this there are two options (although there may be more):

Sidewinder the data into the databroker (although we are missing tons of metadata, some of it critical (x-ray wavelength?))
Put the data into filestore and hand the sample database the tokens. On retrieval we can open up the data.

sbillinge commented 6 years ago

This speaks to a more general discussion probably about what should be in a filestore and what should be in databrokers.

For my money this is less a philosophical issue and more a file-size issue. If the file-size is greater than XXXX it should go into filestore?

It may also be a searchability issue though. We don't want to search through large files of data for metadata, but we don't want to lose large amounts of metadata into large datafiles that are not in databroker.

What are your thoughts? the file-size limit may be the simplest thing.

sbillinge commented 6 years ago

on a similar topic, wherever the x-ray data go, when we find out later what the x-ray wavelength was, how can we then associate it? mutable databroker? non-mutable but some kind of event stream that zips the info together?

CJ-Wright commented 6 years ago

I am talking stricktly about the numerical array data. I presume that we'd parse any metadata in those files out into a dict somewhere?

To you second post: I'm not certain there was discussions a long time ago about making databroker documents amendable (such that it would keep the history of the amendments so you could go back to the original data) I don't know where that discussion is currently. We could have two streams one of data and one for the energy, and then change which energy we point to using some searching capabilities (this is what is planned at XPD I think).

sbillinge commented 6 years ago

I am not sure the best way forward. I thought about parsing out the numerical array data, but then we need tools to do that reliably, which is ok when people are using known file formats but could be a big overhead to maintain.

The reason it is an important question is that if we are generating thousands of processed PDFs, FQ's etc. etc., when we decide to store rather than recompute them, do we parse out those arrays to a filestore and propapate a token, or do we just store the arrays in databroker.....I don't know the answer.

the current issue is just forcing our hand to make this decision I guess. Pro parsing out arrays to filestore:

elegance
doesn't slow down searches in db
its just the right thing to do
???? Con:
oof, in the future we may be keeping track of every file-format and its family, turning this into a full-time job.
overkill?
creating a complex solution where a simple one works just fine?

The answers to 2 and 3 will depend on performance I guess

CJ-Wright commented 6 years ago

In the Pro category we should add:

Space friendly (the data could live separate from the metadata database eg on tape)

The file format issue is a problem no matter which way we turn. If we are going to store data we will either

a) need to parse it on the way into some uniform storage method (filestore, hdf5, filestore+hdf5, raw json, etc.) b) need to parse it every time we want to look at the data if we leave the data on disk in its current format

I would say the definition of overkill is "creating a complex solution where a simple one works just fine?" :smile: