Open ivanmicetic opened 2 years ago
Completely agree. First step would be to draw up a concrete list of the profiles (and types) that need this property added.
We should also look at the citation
property which will need its domain extended to include several Bioschemas types.
From a practical point of view, it's best to have this as recommended and not mandatory, and also to establish that one direction only is fine, ie, a property going from dataset to its entities, such as schema:about
, would be enough, even if the other direction (eg, schema:isSubjectOf
) isn't available (schema:citation
seems to be designed for a different purpose).
I'm mainly thinking of those cases where the dataset/source is implicitly always the same in a given context, eg, an API serving a single dataset, all the pages in a web site. Adding a dynamically-spawn provenance annotation to resources like these is easy, but it might create overhead to data consumers. As it would create overhead to a static data dump (where all has the same provenance).
List of profiles that should have includedInDataset property: |
name | latest_release | latest_publication | notes |
---|---|---|---|---|
BioSample |
0.1-DRAFT-2019_11_12 | should have includedInDataset |
||
ChemicalSubstance |
0.4-RELEASE | should have includedInDataset |
||
ComputationalTool |
1.0-RELEASE | 1.1-DRAFT | should have includedInDataset |
|
ComputationalWorkflow |
1.0-RELEASE | should have includedInDataset |
||
Course |
1.0-RELEASE | should have includedInDataset |
||
CourseInstance |
1.0-RELEASE | |||
DataCatalog |
0.3-RELEASE-2019_07_01 | 0.4-DRAFT | ||
Dataset |
1.0-RELEASE | 1.1-DRAFT | ||
Disease |
0.2-DRAFT | should have includedInDataset |
||
Event |
0.3-DRAFT | should have includedInDataset |
||
FormalParameter |
1.0-RELEASE | 1.1-DRAFT | ||
Gene |
1.0-RELEASE | 1.2-DRAFT | should have includedInDataset |
|
Journal |
0.3-DRAFT | should have includedInDataset |
||
LabProtocol |
0.7-DRAFT | should have includedInDataset |
||
MolecularEntity |
0.5-RELEASE | 0.6-DRAFT | should have includedInDataset |
|
Organization |
0.3-DRAFT | should have includedInDataset |
||
Person |
0.3-DRAFT | should have includedInDataset |
||
Phenotype |
0.2-DRAFT | should have includedInDataset |
||
Protein |
0.11-RELEASE | 0.12-DRAFT | should have includedInDataset |
|
ProteinStructure |
0.6-DRAFT | should have includedInDataset |
||
PublicationIssue |
0.3-DRAFT | |||
PublicationVolume |
0.3-DRAFT | |||
RNA |
0.2-DRAFT | should have includedInDataset |
||
Sample |
0.2-RELEASE-2018_11_10 | should have includedInDataset |
||
ScholarlyArticle |
0.3-DRAFT | |||
SemanticTextAnnotation |
0.3-DRAFT | |||
SequenceAnnotation |
0.7-DRAFT | |||
SequenceRange |
0.2-DRAFT | |||
Study |
0.3-DRAFT | should have includedInDataset |
||
Taxon |
0.6-RELEASE | 0.8-DRAFT | ||
TaxonName |
0.2-DRAFT | |||
TrainingMaterial |
1.0-RELEASE | 1.1-DRAFT | should have includedInDataset |
List of types that should have includedInDataset property: |
name | latest_release | latest_publication | notes |
---|---|---|---|---|
BioChemEntity |
0.7-RELEASE-2019_06_19 | should have includedInDataset |
||
BioChemStructure |
0.1-DRAFT-2019_06_20 | should have includedInDataset |
||
BioSample |
0.1-RELEASE-2019_06_19 | should have includedInDataset |
||
ChemicalSubstance |
0.3-RELEASE-2019_09_02 | should have includedInDataset |
||
ComputationalWorkflow |
1.0-RELEASE | should have includedInDataset |
||
DNA |
0.2-DRAFT-2019_06_20 | should have includedInDataset |
||
Enzyme |
0.1-DRAFT-2019_06_20 | should have includedInDataset |
||
FormalParameter |
1.0-RELEASE | should have includedInDataset |
||
Gene |
0.3-RELEASE-2019_09_02 | should have includedInDataset |
||
LabProtocol |
0.3-DRAFT-2019_06_20 | should have includedInDataset |
||
MolecularEntity |
0.3-RELEASE-2019_09_02 | should have includedInDataset |
||
Phenotype |
0.3-DRAFT-2020_06_07 | should have includedInDataset |
||
Protein |
0.3-RELEASE-2019_09_02 | should have includedInDataset |
||
RNA |
0.1-DRAFT-2019_06_21 | should have includedInDataset |
||
Sample |
0.2-DRAFT-2018_11_09 | should have includedInDataset |
||
SequenceAnnotation |
0.1-DRAFT-2019_06_21 | |||
SequenceMatchingModel |
0.1-DRAFT-2019_06_21 | |||
SequenceRange |
0.1-DRAFT-2019_06_21 | |||
Study |
0.3-DRAFT | should have includedInDataset |
||
Taxon |
0.3-RELEASE-2019_11_18 | 0.4-DRAFT | ||
TaxonName |
0.1-DRAFT |
Basically all "top level" classes should have includedInDataset
property (like Protein
) while classes which exists only as nested properties should not have it (like SequenceAnnotation
or SequenceRange
). I believe that we don't have an automatic way of identifying top level classes from nested ones...
For profiles based on Schema.org types like ComputationalTool (based on schema:SoftwareApplication) -- this would mean adding a new property to a profile that is not available in the type, correct? At what point do we consider creating a new type? ComputationalTool already has input
and output
which are not in schema:SoftwareApplication. Should we be pushing these new properties to schema.org?
@gtsueng for anything new in schema.org we will have to prove that there are consumers and not only providers. By now, new properties will have to live in the Bioschemas context, same as schema.org properties with new usage will not be parsed (by search engines) but will not give troubles (it will be a warning rather than an error). @albangaignard they still should be considered for Bioschemas validation.
@ivanmicetic will you take care of updating the profiles? If no objections, I would delay and join updates with DefineTerm. We worked on list with @nsjuty . I will update that list in the corresponding issue https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631
@ivanmicetic will you take care of updating the profiles together with DefinedTerm? List of profiles and properties that could benefit from DefinedTerm at https://github.com/BioSchemas/specifications/discussions/618#discussioncomment-5947631 I would suggest to updated first properties from schema.org as some properties now have DefinedTerm and we have not updated them. Please double check which profiles are modifying the range for any property (we need an automatic way to do this, updating from schema.org and keeping customizations put in place in Bioschemas but not sure if we want to delay this issue). Thanks.
Getting markup from data records (like
Gene
,Protein
,ChemicalSubstance
..., from dumps or individual pages) looses connection to theDataset
(and it's parentDataCatalog
) profile unlessincludedInDataset
property is present. However we have no mention ofincludedInDataset
property in data record profiles.The suggestion is to add
includedInDataset
property to all data record profiles and expand the documentation and use about it.As for the marginality, I am proposing to make it a minimum marginality unless there are uses cases of data records not associated with a
Dataset
/DataCatalog
.