Closed davenquinn closed 1 month ago
This includes migration to the macrostrat_xdd
schema, and an overall simpler table design. Major changes include:
There are a few inconsistencies remaining to be addressed
A couple of thoughts and questions about the updated schema:
entity_type
and relationship_type
tables have two columns: name
and description
. Should we add in an integer column like id
as the primary key rather than using the name
as the primary key/identifier. publication
to represent the source of an article and we can get the necessary information using https://xdd.wisc.edu/api/articles but what does the citation
represent? You mentioned that you have a script to populate this field and I think it makes sense to run when we are inserting in a record into the publication. How can my server trigger/run that script?model_run_id
alongside each source_text
but that does make sense? This means that we have a copy of the text in the database each them we have a run that uses that piece of text. Additionally, if we have user feedback will be create a separate copy for thatmodel_run
table to all_runs
and in fields run_type
(user or model) and feedback_run_id
which is used by user runs to store the model_run
that a user provided feedback on@sarda-devesh to respond to these points in order:
citation
field is just a JSONB extracted directly from the xDD articles API as such: https://github.com/UW-Macrostrat/macrostrat/blob/8788509f7acea2a07905e918213b29a2a8014edd/cli/macrostrat/cli/subsystems/knowledge_graph/__init__.py#L23. It would be ideal if this citation caching happened up front (in your API) rather than as a separate step.source_text
model will be independent of an individual model run. I think this key is superfluous now, but we should probably delete it outright.cache_instruction()
for each new article that it seesentity_type
, relationship_type
, and publication
tables@sarda-devesh some additional complexities to think about with user feedback:
I think this could be accomplished by each entity and relationship having a superseded_by
field that references itself. We would need ways to mark an entity/relationship as deleted without replacing it. And of course that action (deletion or updating) would need to be tied to a changeset_id
(maybe the extended model_run
field you propose?) with a timestamp and username.
maybe the model_run
field should be renamed entity_set
as such:
CREATE TABLE entity_set (
id,
user_id,
model_version,
timestamp,
CHECK user_id IS NOT NULL OR model_version IS NOT NULL
);
I think this is similar to what you were proposing/
In that case, I think that the superseded_by
field makes sense as it allows us to build a chain of updates for a relationship/entitity, which we can easily traverse to build a training dataset to fine-tune the models.
Yeah - I was thinking that changset_id
can just be represented by a "user run"
For the entity_set
table I think we should have a entity_type
field which represents if this is a user run or a model run as we could still like to store the model_version
for a user run? Finally, we need to have a extraction_pipeline_id
version somewhere which is used to represent which version of the Job Manager was used to produce this result
@davenquinn I added a field called run_id
which of type text into the model_run table to capture the run_id outputted by the models:
{
"run_id": "run_2024-04-29_18:56:40.697006",
"extraction_pipeline_id": "0",
"model_name": "example_model",
"model_version" : "example_version",
}
Is that fine? I still use the id
primary key to reference the run in the rest of the tables
Hey @sarda-devesh – the run_id
thing works. I didn't realize that field came from the pipeline, so sorry to have deleted. Is that reference stored anywhere else, e.g., weaviate?
The extraction_pipeline_id
is fine too.
The thing that worries me about merging the model_run
and entity_set
tables is that model outputs will require certain metadata to be set (e.g., the extraction_pipeline_id
and the run_id
) while user-supplied feedback will require a different set of metadata (user_id
, mostly). So we'll need a fancy check constraint or something if we want to catch bad data. But this isn't too worrisome.
run_id
is stored by the job manager to track jobsThe nice thing about a superseded_by
field is that we can get the most up-to-date graph by selecting everything where superseded_by IS NULL
. For deletions, I guess we can just have a "deleted" boolean flag for both relationships and entities
Just bumping that I don't have permissions for the tables entity_type
, relationship_type
, and publication
Starting point to address the schema and management elements of #90, to support