Schema.org-flavored Content Models

ghost commented 10 years ago

Should we mention that these models are mostly derived from schema.org? It might help folks sell it in their agencies if they can say that they're in cahoots with Google, Facebook, Etsy, Github, etc...

source: http://getschema.org/index.php/List_of_websites_using_Schema.org

ghost commented 10 years ago

We could also talk about how we mapped the M1313 schema over with schema.org to identify the most salient metadata :)

jpgsa commented 10 years ago

Yes, we should say that. :) I also recall us taking some lessons from RDFa syntax. We should mention that at our webinar next week week. https://www.digitalgov.gov/event/what-structured-content-can-do-for-you-article-model/

elucify commented 10 years ago

I just received a notice about this, and have looked at the website. I get what the webinar is supposed to communicate, but I have some questions:

Why create a schema.org "flavored" schema instead of just adopting schema.org content models?
What was inherited from RDFa? From what I can see on the site, it's pretty much all schema.org, except what isn't (like RelatedURLs)?
Where's the schema? (RelaxNG, XSD, whatever?)
Does RelatedURLs have an attribute that indicates how the URLs are "related"?

It looks like a reasonably good idea, but I don't see any technical artifacts in this repo.

ghost commented 10 years ago

Hi Mark,

I'll let the rest of the group chime in, but I do recall that one of the reasons we didn't use vanilla schema.org is due to it's breadth. It's pretty vast and we were trying to pull out the core parts that we felt would be manageable for authors. We used the M1313 (project open data) to help identify candidates (which overlapped with schema.org).

Would you have preferred just using schema.org instead?

dvito commented 10 years ago

I'd be sure to include that not only do the schema.org content models improve SEO (by allowing your templates to include data appropriately for google and other crawlers), but can be reused towards the end of populating the same meta tags that twitter cards (and other social networking sites) use.

@elucify , The "schema" is really just that those collections of fields should be present as attributes in your rendered HTML, its not a strict standard of what needs to be in the content types of your CMS. It is not a schema is the truest sense of the word, which I found confusing at first.

smileytech commented 10 years ago

@logantpowell , we do discuss a little bit the relationship to schema.org and RDF-A on the FAQ page in the "what are content models" question, but this could probably be strengthened/calrified. Any suggestions for updated language?

@elucify , we've thought about creating formalized schema definitions, but decided to start with the HTML descriptions you see currently. If we do this, what format(s) do you think would be the most useful?

lgrama commented 10 years ago

This discussion really mirrors some of the conversations in the working group and brings up some good issues. Content models can serve two purposes:

Publishing out metadata for search engine and social media optimization purposes. This is where Schema.org or OpenGraph really come in handy and can be really integrated into the web pages that are published out.
Structuring content within Web Content Management systems to enable management of content for presentation and any rules-based publishing activities. This also helps with getting to a content as service model. The models are derived from Schema.org and we have mappings to schema.org. We wanted to simplify the models and use the elements that are most appropriate in the federal environment. We were also concerned about the barrier to entry and wanted to keep things simple. In some ways they are conceptual models that can then be translated by folks into content models or types that work in their CMS.

elucify commented 10 years ago

Thanks for all the responses! That's a lot to reply to, but I'll try to hit the points one at a time.

@logantpowell: About the vastness of schema.org: yes, it's big, but it has the benefit that is standardized. Many of the elements are optional, so one approach would have been to just write a document indicating whether elements are required, recommended, optional, or discouraged based on your organizational priorities. I guess you could consider M1313 to be a standard also; as they say, the nice thing about standards as there are so many to choose from. I suppose the difference here is philosophical. It seems to me that a large, possibly unwieldy model that is standardized is better than something that is more or less isomorphic to an existing standard, but not exactly, and introduces more accidental complexity. On the other hand, there's enormous, unwieldy HL7.

@vito, when you say the elements should be present as attributes of Web documents, do you mean these elements should be used to markup HTML semantically? What systems are going to be able to make use of that annotation? Or am I misunderstanding your point?

@logantpowell: Personally, I would not necessarily have "preferred" pure schema.org. It just seems to me more useful to choose a standardized markups game that is already being interpreted by search engines. Maybe schema.org elements marked up in webpages using RDFa (and not using microdata) would be a good standards-based approach. I think it might be clearer what your content models are for if you were to provide concrete examples of how to use them in information processing systems; for example, using them to mark up webpages for SEO; or, as a document exchange format between agencies.

@smileytech, it doesn't seem to me that creating a RelaxNG compact syntax schema for the model you have created would not be too much work. [Since we're on GitHub, I imagine I will be immediately invited to send a pull request :-).] I prefer RelaxNG-CS because it can be transformed to XSD (which many validators, etc. use), but is more easily readable, writable, and explainable, and is transformable to XSD, which most validators use. In fact, if your content model were expressed in RelaxNG, you could use it to generate the documentation. Generally, I think it's easier to start with the machine processable format, and transform it to something human-readable, instead of trying to go the other way.

@lgrama, if your content models truly are simplifications of schema.org, and there is an unambiguous map between them, it would be nice to see that map in the repo. Looking between the models, it seems that your Title element is the same as the schema.org event.name. But I'm not really sure, and an explicit map would clarify such questions.

Some of the elements I find puzzling. For example, I can't tell what RelatedURLs means. As far as I can tell, there's no way to indicate the relationship between the document into the URLs that it says are "related". This severely limits the utility of those links, because there's no way for information processing system to know what the links are for. An additional benefit of adopting RDFa for link markup would be that the links within related URLs could be scoped by additional RDFa statements, providing semantics to what is now just a bag of links.

These are just some thoughts. I look forward to seeing where this goes. Sorry I won't be able to make it to your demos tomorrow, I'm sure things would be clearer to me if I could.

Cheers

ghost commented 10 years ago

@elucify I like the idea of putting a table where we show the schema.org schema along side the M1313 schema to identify 'required' or 'highly recommended' metadata. Actually, this is very much the same as how we ended up with the schema.

@all: should / could we do something like this using GitHub's new interactive tables?

https://github.com/blog/1601-see-your-csvs

gbinal commented 10 years ago

FWIW, if it'll help, I'm happy to help anybody get things in Github Pages, a la Project Open Data's tables and charts.

G

philipashlock commented 10 years ago

I do think it would be helpful to see a mapping to Schema.org at least to see where better alignment is easily possible. In some cases it almost looks like the Article content model intentionally diverges from schema.org. For example the property DateFirstPublished is defined exactly the same as the schema.org property called datePublished and I don't see why they can't use the same name.

@logantpowell The M1313 schema (which I'll refer to as the Project Open Data or POD schema) is actually based on DCAT and the schema.org Dataset schema was later based on DCAT as well. You can find the mapping between DCAT and Schema.org Dataset schema at: http://www.w3.org/wiki/WebSchemas/Datasets#Mappings

Unfortunately the current POD schema was developed before DCAT was finalized and DCAT evolved a bit since Project Open Data came out. We're now in the process of updating the POD schema to address issues that have come up in the past year and also to better align with DCAT/Schema.org.

You can track that progress at https://github.com/project-open-data/project-open-data.github.io/labels/schema

You can also see the mapping between the POD Schema, DCAT, and Schema.org at: http://project-open-data.github.io/metadata-resources/

ghost commented 10 years ago

@philipashlock as we progress down this path, I wonder if we should just be spreading the existing gospel rather than rolling our own...

Perhaps, if we feel really passionate about adding some metadata to a schema.org model, we could rather act as a liaison for government within the schema.org community?

Thoughts?

ghost commented 10 years ago

@philipashlock Will the updated POD schema include 'audience'?

philipashlock commented 10 years ago

I just created a pull request (#6) of a crosswalk table to see where there were opportunities for more alignment between this Article Content model and the schema.org Article Type.

There's definitely a lot of alignment to begin with - although none of the camel-cased capitalization matches. The places where they diverge seem almost more accidental rather than intentional since it's not clear what value might come from the alternative. It does look like there are a few instance where fields were based on DCAT/POD instead of Schema.org, but since the Schema.org Dataset Type already defines a mapping to DCAT it would make more sense to stick with one vocabulary.

A few notes about ambiguities or incompatibilities in the mapping:

articleSection from schema.org does not seem to be compatible with the way ArticleSection has been defined in the OASCM model. They seem to have different purposes. HTML5 does have a section element as well as headings that sound more like the ArticleSection, SectionTitle, and SectionBody fields in the OASCM model so perhaps we just use those instead. The schema.org Article type also allows these to be nested so an Article can be part of another Article using the isPartOf property.
I don't really have any idea what differences are between the different dates: DatePosted, DateFirstPublished, DateReleased, but I did my best to align them with the schema.org properties. There wasn't a field to align with DatePosted but perhaps that's redundant anyway?
I wasn't sure about what might align with Topic but maybe something like about, genre, keyword, articleSection - otherwise this could be a custom property. It looks like there are a few other types across schema.org that have some kind of category related property, so perhaps the new one would be articleCategory
I wasn't sure about what might align with RelatedURL but perhaps citation, isBasedOnUrl, isPartOf or mentions? None of these seem quite right. I think @elucify is right that you'd want to show what the relationship is, but it does look like there are other types that have a property like this, eg see relatedLink
I wasn't sure about what might align with RelatedMultimedia but associatedMedia/encoding might work if the media is meant to be a representation of the article in another medium. Otherwise, I don't think they're quite the same thing.

philipashlock commented 10 years ago

@logantpowell There's no proposal I'm aware of to add audience to the POD schema, but it seems like a pretty good idea to me. Feel free to propose it at https://github.com/project-open-data/project-open-data.github.io/issues

I personally do think it would make more sense to extend schema.org rather than do something sort of inspired by it, but not actually using it. There have already been a number of Schema.org Types that were developed for more government specific purposes, eg GovernmentService (which is particularly relevant to usa.gov) and Dataset which is based on DCAT (which is primarily focused on use cases where government is the publisher).

There's information about extending schema.org at http://schema.org/docs/extension.html or you can join the mailing list http://lists.w3.org/Archives/Public/public-vocabs/

philipashlock commented 10 years ago

One point that I think is worth emphasizing is that the nested types for properties in schema.org aren't necessarily required and can instead just be text. Viewed this way schema.org actually seems quite simple and not so vast. With this in mind, the schema.org Article type could be even simpler than the proposed Article Content Model which has several nested types. That said, I do see the value in providing a profile that's a little more of an explicit usage of a schema.org type, so it might still be good to articulate required properties and whether nested types should be required, encouraged, discouraged, or prohibited for certain properties.

Here's the clarifying language:

Expected types vs text. When browsing the schema.org types, you will notice that many properties have "expected types". This means that the value of the property can itself be an embedded item (see section 1d: embedded items). But this is not a requirement—it's fine to include just regular text or a URL. In addition, whenever an expected type is specified, it is also fine to embed an item that is a child type of the expected type. For example, if the expected type is Place, it's also OK to embed a LocalBusiness.

Source: http://schema.org/docs/gs.html#schemaorg_expected

ghost commented 10 years ago

I actually don't think we will need to extend schema.org for our purposes as a working group, but we could definitely share the guidance you shared and perhaps go over some reasons folks might want to do this.

I also think we should discuss further your comment about the advantages / disadvantages implicit in using either the nested or text-only approach.

VladimirAlexiev commented 1 year ago

I want to second the need to map these models to schema.org. Eg we want to use https://gsa.github.io/Open-And-Structured-Content-Models/models/event-model.html in the Ontotext Knowledge Graph, but we cannot do it directly since it's not mapped to schema.org (OTKG is based on schema.org).

Some of the mapping is self-evident (eg Event -> schema:Event)
But others are not. Eg “Speaker”: we want such person to be related to Event through a creative work (contribution, eg presentation), not directly.

GSA / Open-And-Structured-Content-Models

Schema.org-flavored Content Models #2