Sage-Bionetworks / schematic_db

A python package that manages data backends for schematic
0 stars 1 forks source link

In JIRA: schematic_db: test PostgreSQL logic with iAtlas schema #123

Closed milen-sage closed 1 year ago

milen-sage commented 1 year ago

Reproduce a version of the iAtlas DB schema and DB instance using schematic DB.

AC:

milen-sage commented 1 year ago

Data model documentation: https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/2473623559/The+Data+Model+Schema

GiaJordan commented 1 year ago

Data Modeling Workshop slides

milen-sage commented 1 year ago

@andrewelamb schematic's example data model csv may be handy as well (it's in the schematic repo).

andrewelamb commented 1 year ago

Schema repo

andrewelamb commented 1 year ago

@GiaJordan @mialy-defelice Me and @milen-sage just had a meeting and we agreed on these chnages:

  1. When schematic creates the uuid's during manifest validation we want this column to be named "id" instead of "uuid" going forward.
  2. When a a validation rule is created like: "matchAtLeastOne patient.id" the attribute with the rule will be matched with the uuid's in the "id" column of the patient component even if patient doesn't have an attribute named "id"
  3. schematic_db will assume the "id" column exists in all manifests and this will serve as the primary key in the database and will also be called "id".
  4. schematic_db will assume that if a component depends on other components then it will have a foreign key to that component and will be named "<table>_id" where table is the name of the component the foreign key exists in, and this will have the validation rule "matchAtLeastOne <table>.id"

Gianna and Mialy, could you comment on if doing no.1 would be acceptable, and if no.2 is true?

Milen is there anything I missed?

mialy-defelice commented 1 year ago
  1. When schematic creates the uuid's during manifest validation we want this column to be named "id" instead of "uuid" going forward.

This seems acceptable to me.

  1. When a a validation rule is created like: "matchAtLeastOne patient.id" the attribute with the rule will be matched with the uuid's in the "id" column of the patient component even if patient doesn't have an attribute named "id"

This part "even if patient doesn't have an attribute named "id"" is throwing me off. Can you elaborate more? I thought you're forcing all tables to have an id column...

  1. schematic_db will assume the "id" column exists in all manifests and this will serve as the primary key in the database and will also be called "id".
  2. schematic_db will assume that if a component depends on other components then it will have a foreign key to that component and will be named "<table>_id" where table is the name of the component the foreign key exists in, and this will have the validation rule "matchAtLeastOne <table>.id"

These parts make me nervous wrt the GFF project since they are making their own PKs and end up referencing them in other tables... will these changes be backwards compatible?

andrewelamb commented 1 year ago
  1. When a a validation rule is created like: "matchAtLeastOne patient.id" the attribute with the rule will be matched with the uuid's in the "id" column of the patient component even if patient doesn't have an attribute named "id"

This part "even if patient doesn't have an attribute named "id"" is throwing me off. Can you elaborate more? I thought you're forcing all tables to have an id column...

Yep. Say we have two components, patients and samples where samples depends on patients. The patients component only has one attribute, "patient_name", so a column of uuids is generated by schematic with the column name "id". The samples component only has two attributes "sample_name" and "patient_id". What I want to know is if we give the "patient_id" attribute the validation rule "matchAtLeastOne patient.id" will schematic ensure that every value in the "patient_id" column will be matched to the "id" column in patients that schematic generated even if it's not in the schema.

  1. schematic_db will assume the "id" column exists in all manifests and this will serve as the primary key in the database and will also be called "id".
  2. schematic_db will assume that if a component depends on other components then it will have a foreign key to that component and will be named "<table>_id" where table is the name of the component the foreign key exists in, and this will have the validation rule "matchAtLeastOne <table>.id"

These parts make me nervous wrt the GFF project since they are making their own PKs and end up referencing them in other tables... will these changes be backwards compatible?

@milen-sage I think you should chime in on this. I think he mentioned that we are going to have them go through DCA in the future so full backwards compatibility might be possible anyway. But, in this system they could still make their own keys. They would just have to name them in the same way:

andrewelamb commented 1 year ago

@mialy-defelice To make schematic_db more backwards compatible I can keep in some code I was going to remove that allowed users to create a function that would generate the names of primary keys based on the table. I was going to remove it, but I cna keep it in for now. It was designed as GFF keys as an example.

GiaJordan commented 1 year ago
  1. When schematic creates the uuid's during manifest validation we want this column to be named "id" instead of "uuid" going forward.

Also fine with me

  1. When a a validation rule is created like: "matchAtLeastOne patient.id" the attribute with the rule will be matched with the uuid's in the "id" column of the patient component even if patient doesn't have an attribute named "id"

This is possible, but will require me to tweak parts of how the validation rule warnings are generated. It shouldn't be a big fix however. I may be able to roll it into the fix for #1074. This could be an issue with existing manifests, as they would need to be re-uploaded or otherwise updated to change the Uuid column to id as the cross manifest rule looks at already uploaded manifests.

@andrewelamb

GiaJordan commented 1 year ago

Yep. Say we have two components, patients and samples where samples depends on patients. The patients component only has one attribute, "patient_name", so a column of uuids is generated by schematic with the column name "id". The samples component only has two attributes "sample_name" and "patient_id". What I want to know is if we give the "patient_id" attribute the validation rule "matchAtLeastOne patient.id" will schematic ensure that every value in the "patient_id" column will be matched to the "id" column in patients that schematic generated even if it's not in the schema.

@andrewelamb I can update the error/warning generation so that this is supported. The values in the patient_id column would have to be the same uuids as the id column

milen-sage commented 1 year ago

@andrewelamb the way you describe how primary keys supplied by users can be handled is accurate. This approach seems backward compatible with the GFF project too, given the bigger changes in their workflow that would be required.

milen-sage commented 1 year ago

@GiaJordan point taken on scenarios where users might want to cross validate ids in existing manifest containing uuid. In the case of GFF, manifests will be reuploaded anyway. In the case of HTAN, curators wouldn't make use of this id cross validating functionality since their schema is not setup with RDB conventions.

MiekoHash commented 1 year ago

Thanks, @andrewelamb , for bringing up this ticket to our attention. I feel like this is a good example ticket that would require a discussion in preparation for the refinement meeting in order to identify what's left to do, for whom, alignment with story map, etc.. I will defer to the dev team to decide whether to handle it via one-off vs recurring meeting. cc: @mialy-defelice , @GiaJordan , @milen-sage , @AmyHeiser

MiekoHash commented 1 year ago

Move to JIRA: https://sagebionetworks.jira.com/browse/FDS-100