atorus-research / metacore

https://atorus-research.github.io/metacore/
Other
33 stars 8 forks source link

Any thoughts on how to expand the metacore data model to other dimensions #51

Open feigs opened 2 years ago

feigs commented 2 years ago

The idea of centralizing metadata into a relational data structure is a great one, although right now the current model only accounts for a "slice" of possible metadata storage within an R session. Think of one metacore object per dataset type (e.g. SDTM x ADAM), Task, Study, etc. One could of course just loop over the different dimensions and store individual metacore objects in a list, but some of this information could also be natively incorporated into the model. Are you currently discussing on how to expand the structure to account for these other dimensions or is this not part of the scope of the project?

statasaurus commented 2 years ago

I think for now we want to keep metacore specific to datasets for now. But, we would like to increase the number of metadata packages so more automation can be driven by metadata. I am currently working on a separate package called tfrmt which creates an object to store display metadata and applies that metadata when a dataset is available.

Additionally I would love to build out the derivations table of metacore to make it easier to apply simple derivations like BMI or other straight forward calculations.

Out of curiosity what tasks are you thinking about?

feigs commented 2 years ago

Hi Christina, thanks for the quick reply. I was also thinking of metadata for TLF / output generation (which I understand you are currently developing with tfrmt). But in general other possibilities could come to mind, like specifying metadata for recurrent tasks (oversight, safety reports, DSUR, integrating data from multiple studies, etc). In the end, you could also store this information on a separate table, which would be linked to the existing metacore structure (this task depends on datasets abc, produces outputs xyz, the output contains variables efg, has formatting..., etc). That's more or less the direction I was thinking. I think "metadata" is a broad term and I agree with you, maybe it makes more sense to split different metadata types into separate packages. But the scaling question cannot be completely dismissed. How do you do if you have 30 different studies for one asset, do you create 30 independent metacore objects?

statasaurus commented 2 years ago

That would be my gut reaction.

In order to build out some of the other things like safety reports etc. I wonder if we could build a different object that would interact with metacores. But I think I need to understand a bit more about the particular changes before making anything

mstackhouse commented 2 years ago

This is an interesting problem - but @feigs metacore object itself is basically a single slice of the specifications for a particular deliverable. This is driven by existing data from the that a company would have on hand to support that CDISC deliverable.

You're essentially asking for a versioning mechanism, which would kind of act like a layer on top of a metacore object. For a DMC, DSUR, etc., you'd have a metacore object for each of those that contains the SDTM or ADaM metadata for each deliverable. This kind of scales into a larger database structure of protocol -> deliverable -> type (i.e. SDTM, ADaM, and then there's realistically more TFL metadata) -> which leads down to the metacore object.

At a larger company scale, this is kind of the import of data from an MDR scale into an R session. So we'd definitely need input of how to scale this. That said, I could see value in a higher level object that talks with different metacore type objects to query out the metadata that you need in program.

feigs commented 1 year ago

Hi @mstackhouse and @statasaurus, since this issue is still open, I would like to give my thoughts on this subject after using metacore for some months. I think, as @mstackhouse has pointed out, metacore cannot be used as a substitute for a full-fledged MDR and thus cannot account for evey possible types of metadata. Maybe a higher-level object could handle metacore and other types of metadata objects. And I know you have been working on other exciting features (table metadata visualization, etc). That being said, we have faced some challenges in using metacore objects across projects (ADaM, SDTM, quasi-CDISC datasets), due to the somewhat narrow definition of the metacore object. Whenever we add more variables to the preset tables (e.g. var_spec), we need to perform our own tests, since the metacore tests are not performed anymore. There might be a way of testing the integrity of the preset variables (say variable, label, length, type, common and format) and not testing the newly added variables, but so far so good. The main issue for us is that you cannot create a metacore object (data model) which contains additional tables, which are usually necessary to produce the datasets themselves (e.g. global information like cut-off dates, paths, project-related definitions, etc). Basically, one cannot extend the metacore data model. I thought this would be possible with metatools, but it is currently not the case. I understand this is a design decision. The current workaround would be to create another R6 object that contains a metacore object and additional metadata. Is it just me trying to extend the tool beyond the use-case it was created for or do you think it makes sense to cover this kind of use-case? Do you get this kind of feedback from other users?

mstackhouse commented 1 year ago

Hi @feigs - I think it would help for you to maybe mock up some code examples to show us how you'd like metacore to be extended. We've discussed extensibility in metacore before, but we need to understand more about how the users would like it to be extensible, so any feedback that you have here would definitely help.

Furthermore, like Christina mentioned there are some specific metadata use cases we've seen which drove the development of the package tfrmt. But we also note that there are viable use cases for things like titles and footnotes that need to be driven from external data like databases or spreadsheets. The lane that we're trying to stick in for metacore is that it is a container for external metadata, which can then be consumed by other packages like metatools, or xportr to carry out actions on those metadata.

Ultimately, I'm not opposed to creating separate R6 objects based on different use cases, or opening up extensibility of existing objects to add in additional non-standard tables. So let us know what / how you might like to see that implemented.