Add `includedInDataset` property to all profiles

ivanmicetic commented 2 years ago

Getting markup from data records (like Gene, Protein, ChemicalSubstance..., from dumps or individual pages) looses connection to the Dataset (and it's parent DataCatalog) profile unless includedInDataset property is present. However we have no mention of includedInDataset property in data record profiles.

The suggestion is to add includedInDataset property to all data record profiles and expand the documentation and use about it.

As for the marginality, I am proposing to make it a minimum marginality unless there are uses cases of data records not associated with a Dataset/DataCatalog.

AlasdairGray commented 1 year ago

Completely agree. First step would be to draw up a concrete list of the profiles (and types) that need this property added.

We should also look at the citation property which will need its domain extended to include several Bioschemas types.

marco-brandizi commented 1 year ago

From a practical point of view, it's best to have this as recommended and not mandatory, and also to establish that one direction only is fine, ie, a property going from dataset to its entities, such as schema:about, would be enough, even if the other direction (eg, schema:isSubjectOf) isn't available (schema:citation seems to be designed for a different purpose).

I'm mainly thinking of those cases where the dataset/source is implicitly always the same in a given context, eg, an API serving a single dataset, all the pages in a web site. Adding a dynamically-spawn provenance annotation to resources like these is easy, but it might create overhead to data consumers. As it would create overhead to a static data dump (where all has the same provenance).

ivanmicetic commented 1 year ago

List of profiles that should have `includedInDataset` property:	name	latest_release	latest_publication
`BioSample`		0.1-DRAFT-2019_11_12	should have `includedInDataset`
`ChemicalSubstance`	0.4-RELEASE		should have `includedInDataset`
`ComputationalTool`	1.0-RELEASE	1.1-DRAFT	should have `includedInDataset`
`ComputationalWorkflow`	1.0-RELEASE		should have `includedInDataset`
`Course`	1.0-RELEASE		should have `includedInDataset`
`CourseInstance`	1.0-RELEASE
`DataCatalog`	0.3-RELEASE-2019_07_01	0.4-DRAFT
`Dataset`	1.0-RELEASE	1.1-DRAFT
`Disease`		0.2-DRAFT	should have `includedInDataset`
`Event`		0.3-DRAFT	should have `includedInDataset`
`FormalParameter`	1.0-RELEASE	1.1-DRAFT
`Gene`	1.0-RELEASE	1.2-DRAFT	should have `includedInDataset`
`Journal`		0.3-DRAFT	should have `includedInDataset`
`LabProtocol`		0.7-DRAFT	should have `includedInDataset`
`MolecularEntity`	0.5-RELEASE	0.6-DRAFT	should have `includedInDataset`
`Organization`		0.3-DRAFT	should have `includedInDataset`
`Person`		0.3-DRAFT	should have `includedInDataset`
`Phenotype`		0.2-DRAFT	should have `includedInDataset`
`Protein`	0.11-RELEASE	0.12-DRAFT	should have `includedInDataset`
`ProteinStructure`		0.6-DRAFT	should have `includedInDataset`
`PublicationIssue`		0.3-DRAFT
`PublicationVolume`		0.3-DRAFT
`RNA`		0.2-DRAFT	should have `includedInDataset`
`Sample`	0.2-RELEASE-2018_11_10		should have `includedInDataset`
`ScholarlyArticle`		0.3-DRAFT
`SemanticTextAnnotation`		0.3-DRAFT
`SequenceAnnotation`		0.7-DRAFT
`SequenceRange`		0.2-DRAFT
`Study`		0.3-DRAFT	should have `includedInDataset`
`Taxon`	0.6-RELEASE	0.8-DRAFT
`TaxonName`		0.2-DRAFT
`TrainingMaterial`	1.0-RELEASE	1.1-DRAFT	should have `includedInDataset`

List of types that should have `includedInDataset` property:	name	latest_release	latest_publication
`BioChemEntity`	0.7-RELEASE-2019_06_19		should have `includedInDataset`
`BioChemStructure`		0.1-DRAFT-2019_06_20	should have `includedInDataset`
`BioSample`	0.1-RELEASE-2019_06_19		should have `includedInDataset`
`ChemicalSubstance`	0.3-RELEASE-2019_09_02		should have `includedInDataset`
`ComputationalWorkflow`	1.0-RELEASE		should have `includedInDataset`
`DNA`		0.2-DRAFT-2019_06_20	should have `includedInDataset`
`Enzyme`		0.1-DRAFT-2019_06_20	should have `includedInDataset`
`FormalParameter`	1.0-RELEASE		should have `includedInDataset`
`Gene`	0.3-RELEASE-2019_09_02		should have `includedInDataset`
`LabProtocol`		0.3-DRAFT-2019_06_20	should have `includedInDataset`
`MolecularEntity`	0.3-RELEASE-2019_09_02		should have `includedInDataset`
`Phenotype`		0.3-DRAFT-2020_06_07	should have `includedInDataset`
`Protein`	0.3-RELEASE-2019_09_02		should have `includedInDataset`
`RNA`		0.1-DRAFT-2019_06_21	should have `includedInDataset`
`Sample`		0.2-DRAFT-2018_11_09	should have `includedInDataset`
`SequenceAnnotation`		0.1-DRAFT-2019_06_21
`SequenceMatchingModel`		0.1-DRAFT-2019_06_21
`SequenceRange`		0.1-DRAFT-2019_06_21
`Study`		0.3-DRAFT	should have `includedInDataset`
`Taxon`	0.3-RELEASE-2019_11_18	0.4-DRAFT
`TaxonName`		0.1-DRAFT

Basically all "top level" classes should have includedInDataset property (like Protein) while classes which exists only as nested properties should not have it (like SequenceAnnotation or SequenceRange). I believe that we don't have an automatic way of identifying top level classes from nested ones...

gtsueng commented 1 year ago

For profiles based on Schema.org types like ComputationalTool (based on schema:SoftwareApplication) -- this would mean adding a new property to a profile that is not available in the type, correct? At what point do we consider creating a new type? ComputationalTool already has input and output which are not in schema:SoftwareApplication. Should we be pushing these new properties to schema.org?

ljgarcia commented 1 year ago

@gtsueng for anything new in schema.org we will have to prove that there are consumers and not only providers. By now, new properties will have to live in the Bioschemas context, same as schema.org properties with new usage will not be parsed (by search engines) but will not give troubles (it will be a warning rather than an error). @albangaignard they still should be considered for Bioschemas validation.

@ivanmicetic will you take care of updating the profiles? If no objections, I would delay and join updates with DefineTerm. We worked on list with @nsjuty . I will update that list in the corresponding issue https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631

ljgarcia commented 1 year ago

@ivanmicetic will you take care of updating the profiles together with DefinedTerm? List of profiles and properties that could benefit from DefinedTerm at https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631 I would suggest to updated first properties from schema.org as some properties now have DefinedTerm and we have not updated them. Please double check which profiles are modifying the range for any property (we need an automatic way to do this, updating from schema.org and keeping customizations put in place in Bioschemas but not sure if we want to delay this issue). Thanks.

BioSchemas / specifications

Add `includedInDataset` property to all profiles #617