BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
54 stars 52 forks source link

Add `includedInDataset` property to all profiles #617

Open ivanmicetic opened 1 year ago

ivanmicetic commented 1 year ago

Getting markup from data records (like Gene, Protein, ChemicalSubstance..., from dumps or individual pages) looses connection to the Dataset (and it's parent DataCatalog) profile unless includedInDataset property is present. However we have no mention of includedInDataset property in data record profiles.

The suggestion is to add includedInDataset property to all data record profiles and expand the documentation and use about it.

As for the marginality, I am proposing to make it a minimum marginality unless there are uses cases of data records not associated with a Dataset/DataCatalog.

AlasdairGray commented 1 year ago

Completely agree. First step would be to draw up a concrete list of the profiles (and types) that need this property added.

We should also look at the citation property which will need its domain extended to include several Bioschemas types.

marco-brandizi commented 1 year ago

From a practical point of view, it's best to have this as recommended and not mandatory, and also to establish that one direction only is fine, ie, a property going from dataset to its entities, such as schema:about, would be enough, even if the other direction (eg, schema:isSubjectOf) isn't available (schema:citation seems to be designed for a different purpose).

I'm mainly thinking of those cases where the dataset/source is implicitly always the same in a given context, eg, an API serving a single dataset, all the pages in a web site. Adding a dynamically-spawn provenance annotation to resources like these is easy, but it might create overhead to data consumers. As it would create overhead to a static data dump (where all has the same provenance).

ivanmicetic commented 1 year ago
List of profiles that should have includedInDataset property: name latest_release latest_publication notes
BioSample 0.1-DRAFT-2019_11_12 should have includedInDataset
ChemicalSubstance 0.4-RELEASE should have includedInDataset
ComputationalTool 1.0-RELEASE 1.1-DRAFT should have includedInDataset
ComputationalWorkflow 1.0-RELEASE should have includedInDataset
Course 1.0-RELEASE should have includedInDataset
CourseInstance 1.0-RELEASE
DataCatalog 0.3-RELEASE-2019_07_01 0.4-DRAFT
Dataset 1.0-RELEASE 1.1-DRAFT
Disease 0.2-DRAFT should have includedInDataset
Event 0.3-DRAFT should have includedInDataset
FormalParameter 1.0-RELEASE 1.1-DRAFT
Gene 1.0-RELEASE 1.2-DRAFT should have includedInDataset
Journal 0.3-DRAFT should have includedInDataset
LabProtocol 0.7-DRAFT should have includedInDataset
MolecularEntity 0.5-RELEASE 0.6-DRAFT should have includedInDataset
Organization 0.3-DRAFT should have includedInDataset
Person 0.3-DRAFT should have includedInDataset
Phenotype 0.2-DRAFT should have includedInDataset
Protein 0.11-RELEASE 0.12-DRAFT should have includedInDataset
ProteinStructure 0.6-DRAFT should have includedInDataset
PublicationIssue 0.3-DRAFT
PublicationVolume 0.3-DRAFT
RNA 0.2-DRAFT should have includedInDataset
Sample 0.2-RELEASE-2018_11_10 should have includedInDataset
ScholarlyArticle 0.3-DRAFT
SemanticTextAnnotation 0.3-DRAFT
SequenceAnnotation 0.7-DRAFT
SequenceRange 0.2-DRAFT
Study 0.3-DRAFT should have includedInDataset
Taxon 0.6-RELEASE 0.8-DRAFT
TaxonName 0.2-DRAFT
TrainingMaterial 1.0-RELEASE 1.1-DRAFT should have includedInDataset
List of types that should have includedInDataset property: name latest_release latest_publication notes
BioChemEntity 0.7-RELEASE-2019_06_19 should have includedInDataset
BioChemStructure 0.1-DRAFT-2019_06_20 should have includedInDataset
BioSample 0.1-RELEASE-2019_06_19 should have includedInDataset
ChemicalSubstance 0.3-RELEASE-2019_09_02 should have includedInDataset
ComputationalWorkflow 1.0-RELEASE should have includedInDataset
DNA 0.2-DRAFT-2019_06_20 should have includedInDataset
Enzyme 0.1-DRAFT-2019_06_20 should have includedInDataset
FormalParameter 1.0-RELEASE should have includedInDataset
Gene 0.3-RELEASE-2019_09_02 should have includedInDataset
LabProtocol 0.3-DRAFT-2019_06_20 should have includedInDataset
MolecularEntity 0.3-RELEASE-2019_09_02 should have includedInDataset
Phenotype 0.3-DRAFT-2020_06_07 should have includedInDataset
Protein 0.3-RELEASE-2019_09_02 should have includedInDataset
RNA 0.1-DRAFT-2019_06_21 should have includedInDataset
Sample 0.2-DRAFT-2018_11_09 should have includedInDataset
SequenceAnnotation 0.1-DRAFT-2019_06_21
SequenceMatchingModel 0.1-DRAFT-2019_06_21
SequenceRange 0.1-DRAFT-2019_06_21
Study 0.3-DRAFT should have includedInDataset
Taxon 0.3-RELEASE-2019_11_18 0.4-DRAFT
TaxonName 0.1-DRAFT

Basically all "top level" classes should have includedInDataset property (like Protein) while classes which exists only as nested properties should not have it (like SequenceAnnotation or SequenceRange). I believe that we don't have an automatic way of identifying top level classes from nested ones...

gtsueng commented 1 year ago

For profiles based on Schema.org types like ComputationalTool (based on schema:SoftwareApplication) -- this would mean adding a new property to a profile that is not available in the type, correct? At what point do we consider creating a new type? ComputationalTool already has input and output which are not in schema:SoftwareApplication. Should we be pushing these new properties to schema.org?

ljgarcia commented 1 year ago

@gtsueng for anything new in schema.org we will have to prove that there are consumers and not only providers. By now, new properties will have to live in the Bioschemas context, same as schema.org properties with new usage will not be parsed (by search engines) but will not give troubles (it will be a warning rather than an error). @albangaignard they still should be considered for Bioschemas validation.

@ivanmicetic will you take care of updating the profiles? If no objections, I would delay and join updates with DefineTerm. We worked on list with @nsjuty . I will update that list in the corresponding issue https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631

ljgarcia commented 1 year ago

@ivanmicetic will you take care of updating the profiles together with DefinedTerm? List of profiles and properties that could benefit from DefinedTerm at https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631 I would suggest to updated first properties from schema.org as some properties now have DefinedTerm and we have not updated them. Please double check which profiles are modifying the range for any property (we need an automatic way to do this, updating from schema.org and keeping customizations put in place in Bioschemas but not sure if we want to delay this issue). Thanks.