Develop collection and project schema

laurenwalker commented 6 years ago

We need to develop a schema that will represent a collection and a project in our system.

Collections are first-class objects in Metacat. A collection is an aggregation of datasets created by a user for whatever purpose. They will probably have a minimal set of metadata such as:

title
short description

A Project is a subclass of Collection. Projects have additional metadata such as:

short title (for a friendly URL)
list of personnel
longer descriptions
logo

emilyodean commented 6 years ago

Project XML schema

Collections JSON schema

mbjones commented 6 years ago

TODO: define format types for Collections and Project pages.

laurenwalker commented 6 years ago

Several commits have been made in the past week with updates to the project and collection schemas. Here is a summary of today's commits:

Broke the filter element out into specific filter types (e.g. booleanFilter, textFilter, etc.) - https://github.com/NCEAS/project-papers/commit/71b59e5e81e959a093981dbbfdefeb22bfed8036

Removed the id element from the collection definition - https://github.com/NCEAS/project-papers/commit/cd66aeb619fb65ace157e8bffc435cbd2592b427

Added branding colors - https://github.com/NCEAS/project-papers/commit/af20121e24cdf640b62bf17ba1d7b955dad47211

Added map options - https://github.com/NCEAS/project-papers/commit/a786f1dca67b6a9c107ba8e9e9a7870097a9ef35

Added ability to hide the metrics section - https://github.com/NCEAS/project-papers/commit/41460986deba341518dd5c311550b638ff23268e

laurenwalker commented 6 years ago

The one part of the schema I haven't figured out quite yet -- logos.

How do we want to reference images in the project documents?

Some options:

A. (my preference) A simple element with a string value of an identifier of a logo image in the data repository B. A complex element of EMLEntity type. The image itself will be an object in the data repository, and will use the EMLEntity identifier field to reference the image. It looks like the project schema was originally designed this way, but I think it is overkill. If someone knows why it was originally designed this way, let me know. C. A simple element with a string value of a URL to a logo image anywhere on the web. The downside to this is that if an image URL is ever invalidated (host name changes, image is removed from the web server, etc.), the image won't show up in MetacatUI. Also subject to CORS issues.

csjx commented 6 years ago

Heya @laurenwalker - I could see both (A) and (B) working, and I agree that for a configuration schema, (A) is easier. @mbjones - thoughts on why we have a full eml-entity tree here?

I'd avoid (C), yes, because of the issues you raise. If the logo image is stored in the repo, I'd suggest that the identifier reference be a seriesId so the icon can change without a required project configuration change.

D. Another possibility is to store the images as inline, Base64-encoded strings directly as the <logo> element value. It keeps the configuration together, but is a little more verbose because of the Base64 text. I don't have a strong opinion on this, but it's an option. Rendering inline images is a bout 10% slower than setting a <img src="https://somewhere...">, but I'm not sure if that figure includes the time for the HTTP GET call or not, and it may be a moot point for small logo images (inline may be close to the same rendering time). See https://css-tricks.com/data-uris/.

csjx commented 6 years ago

A few schema comments:

In the schema files, I would change import statements like:

<xs:import namespace="eml://ecoinformatics.org/project-2.2.0" schemaLocation="/Users/datateam/local_repos/eml/xsd/eml-project.xsd"/>

to

<xs:import namespace="eml://ecoinformatics.org/project-2.2.0" schemaLocation="eml-project.xsd"/>

If we are no longer extend the current eml-project module, we can drop that import.
I'm wondering why we are calling the main xs:complexTypes in the project and collection schemas DatasetCollectionType and DatasetProjectType vs CollectionType and ProjectType?
The two schema files need an assigned namespace using the xmlns and targetNamespace attributes on the <xs:schema> root element.
The current EML project module has:
```
<xs:schema  xmlns="eml://ecoinformatics.org/project-2.2.0"
targetNamespace="eml://ecoinformatics.org/project-2.2.0">
```
We might consider eml://ecoinformatics.org/project-2.2.0beta1 and eml://ecoinformatics.org/collection-2.2.0beta1, or something completely different so we don't have collisions with the current project module.
I'm not fully understanding the operator (AND/OR) being applied to a single Filter instance. I would expect that we would be applying an operator to a group of filters (to emulate a Solr parentheses block (keyword:Coho+AND+keyword:SASAP+AND+title:*McKenzie*)). To me, a single Filter operator would be like contains or ends-with or begins-with or matches (which cues us to use an asterix in the value), whereas a FilterGroup operator would be AND or OR and would apply across all of the filters in the group. Do we need to define a FilterGroup? Am I misunderstanding something here?
In the project schema, the element is typed with ent:ImageListType, but that type is not defined in the entity schema. Was this a proposed addition that never got in there? Lauren's proposal above would nix this anyway.

I'll leave it there for now, but am still reviewing. Looking good though Lauren!

laurenwalker commented 6 years ago

I'm wondering why we are calling the main xs:complexTypes in the project and collection schemas DatasetCollectionType and DatasetProjectType vs CollectionType and ProjectType?

We could change the name. I just thought ProjectType could be confused with the EML Project schema, so I added the dataset qualifier. Not sold on it, though.

We might consider eml://ecoinformatics.org/project-2.2.0beta1 and eml://ecoinformatics.org/collection-2.2.0beta1, or something completely different so we don't have collisions with the current project module.

I was thinking we should keep these schemas outside of the eml namespace, since they are pretty Metacat and MetacatUI-specific and won't be used inside EML documents. Up for discussion, though.

I'm not fully understanding the operator (AND/OR) being applied to a single Filter instance.

The operator is set on a filter because each filter can have more than one value. The filter can use the operator field to specify if those values are AND or ORed together. Example:


<filter>
<field>origin</field>
<value>Chris Jones</value>
<value>Christopher Jones</value>
<operator>OR</operator>
</filter>

I think we decided not to include operators in filterGroups because we've always decided to not have advanced filtering options like that in MetacatUI. We could add it in to the schema though if we decide we want to support that in the future.

I've pushed some more changes to the schema based on Chris's feedback. I think we're getting close to finalizing it.

laurenwalker commented 6 years ago

I added a ToggleFilter type to the project schema and rewrote the BooleanFilter type.

Commit: https://github.com/NCEAS/project-papers/commit/8935c4d67e3507c0fa891057666b3b142379ddc2

Summary of changes: The BooleanFilterType will have the same exact fields as a text filter (field, value, label, etc.) except the value will be restricted to booleans. The ToggleFilterType has four additional fields: trueLabel, trueValue, falseLabel, and falseValue.

laurenwalker commented 5 years ago

At this point, the schema for Collections and Projects is starting to get set in stone since we have code in Metacat and MetacatUI that depends on this schema. If anyone has any suggestions for schema changes at this point, let's address them within the next couple weeks.

Schema documents: https://github.com/NCEAS/project-papers/tree/master/schemas

laurenwalker commented 5 years ago

I just sent out a last call for feedback via email to DataONE, NCEAS, and ESS-DIVE developers. After a week or so, I will tag a release for the schemas.

DataONEorg / collections-portals-schemas

Develop collection and project schema #3