Sage-Bionetworks / schematic

Package for biomedical data model and metadata ingress management
https://schematicpy.readthedocs.io/en/latest/cli_reference.html
MIT License
21 stars 24 forks source link

schematic RDB: adding new columns to DB tables (example NF GFF) #757

Closed milen-sage closed 2 years ago

milen-sage commented 2 years ago

NF GFF DB requires adding a few more fields to their DB results:

This should be a good intro issue for working with scheamtic's RDB layer; compare and contrast with the setup of iAtlas.

AC: DB query results and corresponding Synapse tables for the GFF project have been successfully updated with the three fields above.

allaway commented 2 years ago

Related to https://github.com/nf-osi/nf-research-tools-schema/pull/13

allaway commented 2 years ago

I've "staged" the data temporarily here: https://www.synapse.org/#!Synapse:syn31002734/tables/ (in the database, howToAcquire will probably be an attribute of the Resource table and not it's own separate table).

We can either ingress the data properly into the database, or if need be and time is running short, can create a MaterializedView that includes a JOIN to this table to make the data available to platform devs.

allaway commented 2 years ago

I'd like to slight modify the AC for this issue @milen-sage / @andrewelamb if that is OK. We have another issue https://sagebionetworks.jira.com/browse/NFINT-664 that we can resolve at the same time. In addition to the tables mentioned in the initial ticket, can we also please add the Funder and Donor tables in this join? These should all be 1-1 relationships so they should not cause any unexpected replication of tools.

Thank you!

milen-sage commented 2 years ago

@andrewelamb what's the state of this issue?

ychae commented 2 years ago

Jay's web dev team needs these features by 6/16

andrewelamb commented 2 years ago

@allaway I'm still getting up to speed on how this whole workflow happens so hopefully I don't butcher this: @mialy-defelice and me looked through this and the tables we pull the data from need to have the howToAcquire column added.

andrewelamb commented 2 years ago

@milen-sage @allaway Do the Funder and Donor tables need to be added by 6/16, or is the original ask good enough for that purpose?

allaway commented 2 years ago

@allaway I'm still getting up to speed on how this whole workflow happens so hopefully I don't butcher this: @mialy-defelice and me looked through this and the tables we pull the data from need to have the howToAcquire column added.

Done for the Cell Line , Animal Model, Antibody, Biobank, and Genetic Reagent "Resource" manifests here: https://drive.google.com/drive/u/0/folders/15LKpNqrWMvTrbJFI-eFNCm4T4U6VJa6_ !

@milen-sage @allaway Do the Funder and Donor tables need to be added by 6/16, or is the original ask good enough for that purpose?

It would be helpful to have these tables added to the join by 6/16.

andrewelamb commented 2 years ago

@allaway could you give me access to that drive link?

allaway commented 2 years ago

Done @andrewelamb !

andrewelamb commented 2 years ago

@allaway I'm getting a 404 error when trying that link, maybe it's not a permissions issue.

allaway commented 2 years ago

For some reason github doesn't include the underscore as part of the URL.

Try this

andrewelamb commented 2 years ago

@allaway The Investigator -> Development -> Resource join is a many to many relationship (with Development making it 2 one to many relationships). I'm doing inner joins on these assuming you don't want Investigators with no Resources, or Resources with no Investigators, is that correct?

allaway commented 2 years ago

@andrewelamb - thanks for checking! I would like left joins (In Resource x Development / Development x Investigator if it is doable. We do want to keep all of the resources in the final table (i.e. we still want Resources with no Investigators). We don't need or want Investigators with no Resources, though.

allaway commented 2 years ago

Oops pressed the wrong button.

andrewelamb commented 2 years ago

@allaway Same question for funders :)

allaway commented 2 years ago

Same thing! We don't want to eliminate any Resources based on missing info in other tables. So Resources without Funders should be included, but we don't care about Funders without Resources.

andrewelamb commented 2 years ago

So to get to donor we have to go through animal model and/or cell line. How should we do that? I assume the joins are the same as above in regards to joins?

milen-sage commented 2 years ago

@andrewelamb I believe there are some example queries that do this kind of multi table join in the query config of the jupyter notebook @mialy-defelice had put together? Did you take a look at those?

andrewelamb commented 2 years ago

@allaway Besides the "How To Acquire" column is there anything missing from this result: Any columns you don't want?

database_result

allaway commented 2 years ago

Hey Andrew,

Sorry for not laying this issue out more clearly. We'd like these tables to be added to our "mega-join" table: https://www.synapse.org/#!Synapse:syn26438037/tables/

It's currently a join across: Resource_CellLine_AnimalModel_GeneticReagent_Antibody and we'd like to add in Investigator and Donor and Funder. I'm guessing @mialy-defelice has the rest of that join defined somewhere in the notebook (I haven't had a chance to look myself).

andrewelamb commented 2 years ago

@allaway Do you want all the columns from investigators, funders and donors?

allaway commented 2 years ago

Yes, thanks!

allaway commented 2 years ago

@ychae , I think there may have been some miscommunication from our end - Jay's team will need more time to complete their work (probably mainly as it comes to https:/sagebionetworks.jira.com/browse/PORTALS-2178/ which requires implementing some new designs).

I am planning on creating a mock data table for them to use (just using joins in R), and then we can replace with the table that @andrewelamb is creating.

Thanks all!

andrewelamb commented 2 years ago

@allaway The join is ready to go, I'm just dealing with permission issues getting the updated tables uploaded to synapse.

allaway commented 2 years ago

Oh, that's great! Thanks @andrewelamb !

I probably just need to add you to the project?

allaway commented 2 years ago

Oh nevermind you should already have admin access, so I'm guessing you are wrestling with a different permissions issue than I thought.

andrewelamb commented 2 years ago

Yep, working on it :)

andrewelamb commented 2 years ago

@allaway I've got the permission resolved, but I'm getting an error on the new "howtoAcquire" column.

SynapseHTTPError: 400 Client Error: 
The first line is expected to be a header but the values do not match the names of of the columns of the table (howtoAcquire is not a valid column name or id). Header row: resourceId,geneticReagentId,antibodyId,cellLineId,animalModelId,rrid,resourceName,dateModified,mTARequired,howtoAcquire,description,usageRequirements,dateAdded,resourceType,synonyms,md5_id

@mialy-defelice cna confirm, but it looks like it's spelled one way in the jsonld:

        {
            "@id": "bts:howtoAcquire",
            "@type": "rdf:Property",
            "rdfs:comment": "How to acquire a particular resource. ",
            "rdfs:label": "howtoAcquire",
            "schema:domainIncludes": {
                "@id": "bts:Resource"
            },
            "schema:isPartOf": {
                "@id": "http://schema.biothings.io"
            },
            "sms:displayName": "How to Acquire",
            "sms:required": "sms:true",
            "sms:validationRules": []
        },

but another way in the manifests:

howToAcquire

The difference being in if "to/To" is capitalized or not.

milen-sage commented 2 years ago

@mialy-defelice @andrewelamb this [updated link] may be related? This issue can be tackled in the next sprint, but there is a quick manual fix?

allaway commented 2 years ago

I just updated the schema to How To Acquire: https://github.com/nf-osi/nf-research-tools-schema/blob/main/nf-research-tools.jsonld

Let me know if this does not work!

mialy-defelice commented 2 years ago

Hey @allaway , you don't need to change the model. This is a camelcase mismatch on our end (how camelcase gets generated for creating the RDB vs creating the manifests) we currently have a workaround to fix it in the short term. The fix will make sure the RDB branch matches the main schematic branch that creates the manifests. So your original version was 'correct'.

mialy-defelice commented 2 years ago

Hey @milen-sage this is an issue with camelcase mismatch. We have a global variable defined to deal with the few times we have encountered this. I think there is an issue out there (your link isn't pointing to a specific one when I click) about dealing with camel case mismatches between the main Schematic branch and the RDB branch.

milen-sage commented 2 years ago

@mialy-defelice thanks for noting - I updated the link. GH project board (beta) changed how issues are opened and the address bar link no longer contains the url of the issue... it just contains the url of the project board and any associated filters: very annoying when linking in a hurry :)

mialy-defelice commented 2 years ago

@milen-sage I think they are two separate issues. I can't find another issue so I can create one.

The linked issue has to do with how the Portal decodes camelcase (for this one I needed to capitalize a letter manually to fix quick for Jay -- might only happen with the word Of).

The issue here has to do with how schematic creates camelcase vs how the RDB creates camelcase. This means we need to make things consistent so the RDB method matches the schematic method.

allaway commented 2 years ago

Thanks all. I am happy to revert the change of to->To - let me know what is preferred! :)

allaway commented 2 years ago

Ah I see this now: So your original version was 'correct'.

I will revert the change. But since both versions will be available thru git history, let me know which one you end up using!

allaway commented 2 years ago

Hi all: will the final table on Synapse have "howtoAcquire" or "howToAcquire" as the column header?

ychae commented 2 years ago

@allaway do you need it to be something specific?

allaway commented 2 years ago

It doesn't really matter, I can just change the column name on Synapse after it's uploaded to "howToAcquire" which is what Jay's team is expecting. :)

mialy-defelice commented 2 years ago

@allaway did you manually convert that column name to camelcase? When making the manifest it actually shows up as the display name. I was under the impression the naming was coming directly from Schematic itself...

allaway commented 2 years ago

Sorry for the confusion. The manifests are the same manifests you generated, I just added this column. That error is on me. It should be changed to How to Acquire in the spreadsheet.

allaway commented 2 years ago

Apologies, I did not understand that the spreadsheets are what you all were talking about, I thought you were talking about the schema.

mialy-defelice commented 2 years ago

It's no trouble, there are many places where camelcase discrepancies come into play...running into another one right now ( :( ). Hopefully we will get this up soon.

mialy-defelice commented 2 years ago

@ychae @allaway @andrewelamb The table is loaded!!! https://www.synapse.org/#!Synapse:syn31956063/tables/ Robert let me know if there is something we need to update in the query.

mialy-defelice commented 2 years ago

Also I enabled full text search and a bunch of facets but they may need to be added to or trimmed...

allaway commented 2 years ago

Thank you both! This is great. I am about to step out for a few hours but I will check this evening to let you know.

We may need to transfer the data to syn26438037, since that is the equivalent prod table and is the one flagged OPEN_ACCESS. I'm not sure we can get approval to flag a new entity open access before Friday :)

mialy-defelice commented 2 years ago

Do you know how to do this easily? The ids for all the entries are not the same anymore so a simple update is not possible. The md5_id are generated from all the ID columns so since we have more Ids present the hashes will all be new.

allaway commented 2 years ago

I would do this manually by versioning the table (to save the old version), deleting all of the table rows without deleting the table itself, modify the schema as necessary (for new cols) and upload new rows. :) Does that make sense/am I missing something?

mialy-defelice commented 2 years ago

That makes sense! That was my fear lol. I am always terrified of messing with the actual tables... I'll practice in my test project :) I'll update when the table is up.