Schema integration for xDD connections

davenquinn commented 2 months ago

Starting point to address the schema and management elements of #90, to support

Vector embeddings for map legend data
Knowledge graph/structured data extraction from papers

davenquinn commented 2 months ago

This includes migration to the macrostrat_xdd schema, and an overall simpler table design. Major changes include:

Using integer ids instead of UUIDs
Using an extensible, foreign-keyed table for entity and relationship types (instead of a custom enum)
Terser column/table names
Model for citations
Explicit text start and end indices for entities

There are a few inconsistencies remaining to be addressed

A triangular dependency graph between model_run, source_text, and entity/relationship tables.
The source_text table contains nearly-identical source text windows that differ only by post-processing, which makes it hard to represent different relationships together. this may be a problem with Weaviate
We haven't yet figured out how to integrate feedback and Macrostrat's larger user model

sarda-devesh commented 2 months ago

A couple of thoughts and questions about the updated schema:

Currently, the entity_type and relationship_type tables have two columns: name and description. Should we add in an integer column like id as the primary key rather than using the name as the primary key/identifier.
We have a publication to represent the source of an article and we can get the necessary information using https://xdd.wisc.edu/api/articles but what does the citation represent? You mentioned that you have a script to populate this field and I think it makes sense to run when we are inserting in a record into the publication. How can my server trigger/run that script?
We are still storing a model_run_id alongside each source_text but that does make sense? This means that we have a copy of the text in the database each them we have a run that uses that piece of text. Additionally, if we have user feedback will be create a separate copy for that
Finally, I was thinking we should rename the model_run table to all_runs and in fields run_type (user or model) and feedback_run_id which is used by user runs to store the model_run that a user provided feedback on

davenquinn commented 2 months ago

@sarda-devesh to respond to these points in order:

I am fine with either using an integer or the string representation of the type field — whatever is easier to set. In fact I almost made this change but held back from it. Happy to go this direction if you want.
The citation field is just a JSONB extracted directly from the xDD articles API as such: https://github.com/UW-Macrostrat/macrostrat/blob/8788509f7acea2a07905e918213b29a2a8014edd/cli/macrostrat/cli/subsystems/knowledge_graph/__init__.py#L23. It would be ideal if this citation caching happened up front (in your API) rather than as a separate step.
Yes, this is one of the major remaining issues with the current design. ideally the source_text model will be independent of an individual model run. I think this key is superfluous now, but we should probably delete it outright.
That could work, but let me think a bit more about this as I do some UI prototyping

sarda-devesh commented 2 months ago

I think this might be helpful if we want to add more types or potentially support if we want to have more fine grained definitions for these types.
I can update my server to effectively perform the cache_instruction() for each new article that it sees
I also think doing so might reveal if we are missing any additional metadata to represent a "unique piece of text"
Finally, I don't have read/write permissions to the entity_type, relationship_type, and publication tables

davenquinn commented 2 months ago

@sarda-devesh some additional complexities to think about with user feedback:

we want users to be able to provide feedback on individual entities
we want the "validated entities" to be linked back to the non-validated model output that they are based on.

I think this could be accomplished by each entity and relationship having a superseded_by field that references itself. We would need ways to mark an entity/relationship as deleted without replacing it. And of course that action (deletion or updating) would need to be tied to a changeset_id (maybe the extended model_run field you propose?) with a timestamp and username.

davenquinn commented 2 months ago

maybe the model_run field should be renamed entity_set as such:

CREATE TABLE entity_set (
   id,
  user_id,
  model_version,
  timestamp,
  CHECK user_id IS NOT NULL OR model_version IS NOT NULL
);

I think this is similar to what you were proposing/

sarda-devesh commented 2 months ago

In that case, I think that the superseded_by field makes sense as it allows us to build a chain of updates for a relationship/entitity, which we can easily traverse to build a training dataset to fine-tune the models.

Yeah - I was thinking that changset_id can just be represented by a "user run"

sarda-devesh commented 2 months ago

For the entity_set table I think we should have a entity_type field which represents if this is a user run or a model run as we could still like to store the model_version for a user run? Finally, we need to have a extraction_pipeline_id version somewhere which is used to represent which version of the Job Manager was used to produce this result

sarda-devesh commented 2 months ago

@davenquinn I added a field called run_id which of type text into the model_run table to capture the run_id outputted by the models:

{
    "run_id": "run_2024-04-29_18:56:40.697006",
    "extraction_pipeline_id": "0",
    "model_name": "example_model",
    "model_version" : "example_version", 
}

Is that fine? I still use the id primary key to reference the run in the rest of the tables

davenquinn commented 2 months ago

Hey @sarda-devesh – the run_id thing works. I didn't realize that field came from the pipeline, so sorry to have deleted. Is that reference stored anywhere else, e.g., weaviate?

The extraction_pipeline_id is fine too.

The thing that worries me about merging the model_run and entity_set tables is that model outputs will require certain metadata to be set (e.g., the extraction_pipeline_id and the run_id) while user-supplied feedback will require a different set of metadata (user_id, mostly). So we'll need a fancy check constraint or something if we want to catch bad data. But this isn't too worrisome.

sarda-devesh commented 2 months ago

The run_id is stored by the job manager to track jobs
In that case, I think having two separate tables - one for model runs and one for "user runs" makes the most sense?

davenquinn commented 2 months ago

The nice thing about a superseded_by field is that we can get the most up-to-date graph by selecting everything where superseded_by IS NULL. For deletions, I guess we can just have a "deleted" boolean flag for both relationships and entities

sarda-devesh commented 2 months ago

Just bumping that I don't have permissions for the tables entity_type, relationship_type, and publication

UW-Macrostrat / macrostrat

Schema integration for xDD connections #93