UW-Macrostrat / macrostrat

A geological data platform for exploration, integration, and analysis
Apache License 2.0
4 stars 1 forks source link

Schema integration for xDD connections #93

Closed davenquinn closed 1 month ago

davenquinn commented 2 months ago

Starting point to address the schema and management elements of #90, to support

davenquinn commented 2 months ago

This includes migration to the macrostrat_xdd schema, and an overall simpler table design. Major changes include:

There are a few inconsistencies remaining to be addressed

sarda-devesh commented 2 months ago

A couple of thoughts and questions about the updated schema:

davenquinn commented 2 months ago

@sarda-devesh to respond to these points in order:

  1. I am fine with either using an integer or the string representation of the type field — whatever is easier to set. In fact I almost made this change but held back from it. Happy to go this direction if you want.
  2. The citation field is just a JSONB extracted directly from the xDD articles API as such: https://github.com/UW-Macrostrat/macrostrat/blob/8788509f7acea2a07905e918213b29a2a8014edd/cli/macrostrat/cli/subsystems/knowledge_graph/__init__.py#L23. It would be ideal if this citation caching happened up front (in your API) rather than as a separate step.
  3. Yes, this is one of the major remaining issues with the current design. ideally the source_text model will be independent of an individual model run. I think this key is superfluous now, but we should probably delete it outright.
  4. That could work, but let me think a bit more about this as I do some UI prototyping
sarda-devesh commented 2 months ago
  1. I think this might be helpful if we want to add more types or potentially support if we want to have more fine grained definitions for these types.
  2. I can update my server to effectively perform the cache_instruction() for each new article that it sees
  3. I also think doing so might reveal if we are missing any additional metadata to represent a "unique piece of text"
  4. Finally, I don't have read/write permissions to the entity_type, relationship_type, and publication tables
davenquinn commented 2 months ago

@sarda-devesh some additional complexities to think about with user feedback:

  1. we want users to be able to provide feedback on individual entities
  2. we want the "validated entities" to be linked back to the non-validated model output that they are based on.

I think this could be accomplished by each entity and relationship having a superseded_by field that references itself. We would need ways to mark an entity/relationship as deleted without replacing it. And of course that action (deletion or updating) would need to be tied to a changeset_id (maybe the extended model_run field you propose?) with a timestamp and username.

davenquinn commented 2 months ago

maybe the model_run field should be renamed entity_set as such:

CREATE TABLE entity_set (
   id,
  user_id,
  model_version,
  timestamp,
  CHECK user_id IS NOT NULL OR model_version IS NOT NULL
);

I think this is similar to what you were proposing/

sarda-devesh commented 2 months ago

In that case, I think that the superseded_by field makes sense as it allows us to build a chain of updates for a relationship/entitity, which we can easily traverse to build a training dataset to fine-tune the models.

Yeah - I was thinking that changset_id can just be represented by a "user run"

sarda-devesh commented 2 months ago

For the entity_set table I think we should have a entity_type field which represents if this is a user run or a model run as we could still like to store the model_version for a user run? Finally, we need to have a extraction_pipeline_id version somewhere which is used to represent which version of the Job Manager was used to produce this result

sarda-devesh commented 2 months ago

@davenquinn I added a field called run_id which of type text into the model_run table to capture the run_id outputted by the models:

{
    "run_id": "run_2024-04-29_18:56:40.697006",
    "extraction_pipeline_id": "0",
    "model_name": "example_model",
    "model_version" : "example_version", 
}

Is that fine? I still use the id primary key to reference the run in the rest of the tables

davenquinn commented 2 months ago

Hey @sarda-devesh – the run_id thing works. I didn't realize that field came from the pipeline, so sorry to have deleted. Is that reference stored anywhere else, e.g., weaviate?

The extraction_pipeline_id is fine too.

The thing that worries me about merging the model_run and entity_set tables is that model outputs will require certain metadata to be set (e.g., the extraction_pipeline_id and the run_id) while user-supplied feedback will require a different set of metadata (user_id, mostly). So we'll need a fancy check constraint or something if we want to catch bad data. But this isn't too worrisome.

sarda-devesh commented 2 months ago
  1. The run_id is stored by the job manager to track jobs
  2. In that case, I think having two separate tables - one for model runs and one for "user runs" makes the most sense?
davenquinn commented 2 months ago

The nice thing about a superseded_by field is that we can get the most up-to-date graph by selecting everything where superseded_by IS NULL. For deletions, I guess we can just have a "deleted" boolean flag for both relationships and entities

sarda-devesh commented 2 months ago

Just bumping that I don't have permissions for the tables entity_type, relationship_type, and publication