AI-multimodal / aimmdb

BSD 3-Clause "New" or "Revised" License
0 stars 10 forks source link

A FEFF schema #37

Open matthewcarbone opened 1 year ago

matthewcarbone commented 1 year ago

A FEFF schema

In this issue, I'll outline the plan for constructing a schema for FEFF data. We wish to store FEFF data for two purposes:

  1. Medium term storage
  2. For use as a intermediary for storing jobs that have not yet been run

Point 2 is the more interesting one here. I would like the FEFF schema to allow for two "states" of completeness.

Pending calculation: the data would consist of an empty data frame, with just the column names. metadata would contain just the information required for submitting a job.

Complete calculation: the data will now contain the actual spectral data/FEFF output. metadata will contain output logs in addition to everything contained in the pending calculations.

Schema plan

Instead of one schema for both incomplete and complete jobs, let's have two schemas, one for completed FEFF jobs and one for incomplete jobs. I will detail below (lots of edits).

The data

Completed FEFF jobs

FEFF9 spectra output is quite simple. It consists of columnated data with the following columns:

Each column simply contains floats. This should be quite straightforward to implement.

Incomplete FEFF jobs

The DataFrame will have the same columns but will be trivially empty.

The metadata

Note that complete and incomplete FEFF jobs will be linked by a metadata field analogous to sample_id. I think we can actually just call it sample_id. For example, a molecule-site pair will have one entry in the incomplete database and one in the complete (once the job is done); these two data points will be linked by this sample_id. always required

Common metadata that will be searchable:

Completed FEFF jobs

Incomplete FEFF jobs

Comments

@danielballan I know this might not be exactly what you had in mind as far as aimmdb's use cases are concerned, but I would love your feedback on this. We'll be using it for dynamic querying of completed FEFF spectra for inverse design of molecules, and for Mike's really cool frontend GUI for visualizing XAS.

If this idea works we can duplicate the principle for e.g. Gaussian and do geometry optimization.

Finally, this does have a multi-modal aspect, since for a given molecule we'll compute e.g. the C, N and O XANES and use them all for multi-modal structure refinement.

danielballan commented 1 year ago

The schema validation requirements are all supported. Tiled core supports updating metadata, such as to move a dataset between “states” as you describe, but we have not worked that feature into aimmdb (which we should think of as a Tiled plugin). With Joe’s departure and a high priority on ingesting new datasets, this may be more than a month away, but it will certainly happen.

Would you be polling for the set of empty datasets and using Tiled as a kind of work queue? It lacks the synchronization primitives you would get from Redis or Kafka or Celery. For a single worker this may be fine. If you may grow multiple concurrent workers it may be better to move the work queue into a real queue and only store the finished results in Tiled.

matthewcarbone commented 1 year ago

Tiled core supports updating metadata, such as to move a dataset between “states” as you describe, but we have not worked that feature into aimmdb (which we should think of as a Tiled plugin).

I see. I suppose in principle then we could just have two schemas for now: one for completed jobs and one for incomplete jobs. It's certainly a hack but if we can do this then it will let us use the aimmdb framework as is. At least for initial testing. Once this feature is merged in we can just adopt it.

Would you be polling for the set of empty datasets and using Tiled as a kind of work queue? It lacks the synchronization primitives you would get from Redis or Kafka or Celery. For a single worker this may be fine.

Yes and it would be only a single worker/machine. Basically, on HPC I will have a cronjob or something that, every minute, pings aimmdb for incomplete jobs, and pulls those down (and then pushes them back after they complete). Similarly, every few minutes on local, I will ping aimmdb for completed jobs.

If you may grow multiple concurrent workers it may be better to move the work queue into a real queue and only store the finished results in Tiled.

No doubt, and I'm exploring those options too, but I don't have a better database solution than this one right now, and doing this has other indirect benefits, like letting you guys stress test the database a bit more. It's also the path of least resistance for me and Mike, and it will get all of us a nice scientific paper (hopefully!) 😊

matthewcarbone commented 1 year ago

Ok after chatting with Mike and seeing Dan's thumbs up, it seems to make the most sense to have two separate schemas. I'm going to update the main post here with the details.