cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
91 stars 25 forks source link

Use SchemaView method to get list of slots for a class #274

Closed sujaypatil96 closed 2 years ago

sujaypatil96 commented 2 years ago

On https://github.com/turbomam/DataHarmonizer/blob/issue-267-linkml-collab/script/linkml.py#L95, we should use a SchemaView method, class_slots() instead of looking for slots as a key in the dictionary.

Happy to implement this change if the above sounds good.

CC: @ddooley

ddooley commented 2 years ago

Sounds good if there's no side effects on the data being generated!

ddooley commented 2 years ago

Note that approving Mark's pull request has moved linkml.py up to /script/ folder in the cidgo repo.

ddooley commented 2 years ago

So I think I finally understand this. A number of the classes have entirely inherited slots, they don't define any themselves, and the previous line was failing to anticipate that. So your change now yeilds about 289 specifications rather than the old code that generated 35. And schema.js for MIxS is 32mb instead of 3.4mb . I can see my browser doesn't mind but any concerns there?!

ddooley commented 2 years ago

So if above classes are inheriting all slots, then it raises the question of whether we just want to have that inheritance encoded in some way that it is done browser side? Then maybe schema.js doesn't have to be nearly 32mb for MIxS for example.

sujaypatil96 commented 2 years ago

Thank you for the analysis @ddooley. Oh, a 32 MB JSON object is pretty big, but I haven't had my browser complain or crash either, so we should be good?

So if above classes are inheriting all slots, then it raises the question of whether we just want to have that inheritance encoded in some way that it is done browser side?

This might be a good use case for the new gen-typescript utility that's bundled with the latest 1.2.x linkml releases. @cmungall will be able to give better advice on this.

cmungall commented 2 years ago

I am not sure gen-typescript helps, it doesn't generate logic, only type hints/checks

but maybe you mean the full linkml.js framework? It's currently alpha so I wouldn't rely on it

but yes if I understand then broadly this is a classic tradeoff

  1. implement logic centrally and distribute materialized entailments
  2. implement logic downstream and distribute only source

For 1 you of course trade simplicity for atrtefact size. But I agree 32m is not so bad (and could be compressed)

For 2 you need to expend more resources writing possibly duplicative code, especially for a new language (this is where linkml.js would help - you could just use its SchemaView class and work directly with linkml schemas, no pre-processing required). See https://github.com/cmungall/linkml-runtime.js/blob/main/src/SchemaView.ts

you could also do a hybrid model

this may help: https://linkml.io/linkml/developers/tool-developer-guide.html

ddooley commented 2 years ago

It does sound like a full linkml.js implementation would handle this. All I recall is that for example a single "temperature" slot in a slot_definition is inherited 129 times in various classes, so if we just have one full specification for the temperature slot say, then in browser that could be either looked up dynamically, or instantiated across all the schema classes that it finds itself in.

cmungall commented 2 years ago

temperature may be required for one package, recommended for another, and neither for a third

ddooley commented 2 years ago

Understood about the extra parameter info that a particular package has about a generic temperature slot definition (required, recommended, etc). I recall the temperature slot definition is defined on its own once, and then other packages reference it, and this gets instantiated as a slot definition with each package's extra parameters added on. So I was probing whether we can just do that same thing on browser side rather than on [edit] server side. How complex is the code right now in LinkML for generating a slot view that implements this inheritance? The line that I think is capable of doing this is in the linkml-datastructure version circa line 971 of /script/data-harmonizer/index.js :

    let field =  Object.assign({}, self.schema.slots[name], specification_slots[name], specification_slot_usage[name]);

which basically starts with the generic self.schema.slots[name], then adds on / overrides parapeters with specification_slots[name], and finally specification_slot_usage[name]);

ddooley commented 2 years ago

DH is now using SchemaView() entirely, so above comment is unnecessary. Can revisit this if/when SchemaView has a javascript version.