Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
103 stars 71 forks source link

Google Structured Data Testing Tool doesn't like mixing Schema.org and Dublin Core in our RDFa #1074

Open seth-shaw-unlv opened 5 years ago

seth-shaw-unlv commented 5 years ago

Related to our SEO issue #882:

This is probably a documentation issue, but the Google Structured Data Tool doesn't like mixing Dublin Core and Schema.org terms.

Declaring something with a Schema.org Type (e.g. schema:ImageObject) and adding Dublin Core elements to it will throw errors because those properties are not in their scope. E.g. Screen Shot of the structured data tester showing the existence of errors.

The reverse is also true, if you add Schema.org properties (e.g. schema:sameas) to something that doesn't have a Schema.org type it will also complain: Screen Shot showing the structured data tester throwing an error on the type because a schema property was used. It doesn't complain that the property is there, just that the PCDM type is not known to Google.

seth-shaw-unlv commented 5 years ago

I should have also noted that our default Repository Item works fine because the only fields using schema are the created/modified dates which don't appear to show up in the RDFa.

Also should have included an example of straight-up Dublin Core working fine: Screen Shot 2019-03-29 at 8 31 10 AM Although, how much SEO benefit that actually gives us on those fields is unclear to me.

Also, some of the Dublin Core fields, such as subject, don't appear to work well: Screen Shot 2019-03-29 at 8 35 54 AM You can see the subjects values on the left clearly (Costume design, Dancers, etc.) but the structured data tester says these subjects are of an "Unspecified Type". This may be an RDFa issue.

seth-shaw-unlv commented 5 years ago

That last bit with the subjects is my fault. That comes from the subject's href linking a resource the tester can't access. It actually works fine.

dannylamb commented 5 years ago

@seth-shaw-unlv If we give a resource both a schema and dc type then is it ok? Making everything a schema:article by default (in addition to pcdm) seems appropriate. Not sure about dc (I'm assuming there's a generic thing from the dcmi types we can use or something).

In theory that should appease our semantic robot overlords.

seth-shaw-unlv commented 5 years ago

@dannylamb Nope. Still grumpy. I added dcterms:BibliographicResource to the RDFa and Google was still mad, because the dcterm is being applied to a schema Type. It looks like Google doesn't like other vocabularies being used near schema things.

Screen Shot 2019-04-02 at 12 26 14 PM

It looks like either all schema for a Node or none at all as far as Google is concerned.

In semi-related news, the Schema.org architypes proposal was accepted and added to schema.org! This makes schema-only descriptions a bit easier to do. BTW, has anyone thought to do a Dublin Core -> schema.org comparison/map? Could a repository conceivably abandon Dublin Core for pure schema.org without (much) loss?

DiegoPino commented 5 years ago

@seth-shaw-unlv, 2 cents here: one of the reasons why mixing and matching ontologies and properties from different ones is not such a good idea without making sure one property is valid in another's class definition/domain/ontology. Its a bit like the work on MODS to RDF mapping that happened in that great working group: It works for internal use, but is not semantically correct for exposing the data to the outside(and by saying that now i deserve to be hated).

Google tries to apply its Ontology validation correctly and in that one, if an Object is of type Schema:thing, only properties in that domain are valid. And google can not do Ontology Intersection, aligning nor inference, so specifically in RDFa it will try to match any property given to all classes. A better way of getting away with this is avoiding other ontologies in the RDFa(stick with schema) but embed a JSON-LD as script in the body. it is what Zenodo and DataCite are doing with great success. In that case your JSON-LD can have many contexts and Google will not comply (namespaces will match also because the expansion will only apply to the right RDF (or OWL) Class). Still, its good to check if a certain group of properties can freely be moved between ontologies, i highly recommend not doing that without validating.

seth-shaw-unlv commented 5 years ago

@DiegoPino I'm not seeing any examples that would allow us to use multiple ontologies in the JSON-LD and Google still not freaking out. The multiple contexts seem to mostly be used as namespace definitions (multiple mappings of predicates to a field names) but the resulting set of edges still results in a mixing of ontologies. The datacite examples I found of JSON-LD only use schema.org.

Having one set in the JSON-LD script tag and another in the RDFa doesn't work because Google appears to ignore the RDFa when it finds JSON-LD.

So, really, it looks like anything we want to hand off to Google needs to ontological consistency but we can index in our Fedora and triple-store whatever we want. This implies to me that we need to keep the JSON-LD just for indexing and have some way to either filter what gets pushed into the RDFa v. JSON-LD OR separate configs for each.

DiegoPino commented 5 years ago

Hi, i will share some examples with you tomorrow(on the phone now), google can handle some other stuff if inside json-ld. Contexts can in fact contain many ontologies (thats what namespaces are for amongs others) e.g the the iiif presentation context, uses quite a few. But also, you just answered your own issue :). Since you have basically no control in your islandora 8 architecture to remove some predicates from rdfa without affecting every other mapping you have in drupal to talk to fedora, etc, by having a simpler json-ld (and with that i say schema.org only seems the lowest barrier) embedded, you ensure google is happy and you can keep your full blown mix and match for your rdfa and triple store needs. Seems like a win win situation. Now you just need to embed it.

El El mar, 2 de abr. de 2019 a las 17:05, Seth Shaw < notifications@github.com> escribió:

@DiegoPino https://github.com/DiegoPino I'm not seeing any examples that would allow us to use multiple ontologies in the JSON-LD and Google still not freaking out. The multiple contexts seem to mostly be used as namespace definitions (multiple mappings of predicates to a field names) but the resulting set of edges still results in a mixing of ontologies. The datacite examples I found https://blog.datacite.org/schema-org-register-dois/ of JSON-LD only use schema.org.

Having one set in the JSON-LD script tag and another in the RDFa doesn't work because Google appears to ignore the RDFa when it finds JSON-LD.

So, really, it looks like anything we want to hand off to Google needs to ontological consistency but we can index in our Fedora and triple-store whatever we want. This implies to me that we need to keep the JSON-LD just for indexing and have some way to either filter what gets pushed into the RDFa v. JSON-LD OR separate configs for each.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Islandora-CLAW/CLAW/issues/1074#issuecomment-479202161, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn857-BmTpQLGOHsjAGfxpFcOhC_shYks5vc8YkgaJpZM4cSpl5 .

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

dannylamb commented 5 years ago

@seth-shaw-unlv @DiegoPino https://www.drupal.org/project/schema_metatag does just that. We can set up how we want stuff for google and that gets embedded as jsonld. At that point there is a discrepancy between the RDFa and the embedded JSONLD and what goes in Fedora/Triplestore, but I guess Google's behaviour works in our favor there w/rt/t RDFa vs. embedded JSONLD. And really, we have no choice but to separate what Google wants and how users choose to model their data.

seth-shaw-unlv commented 5 years ago

Based on the devel call this week, this issue will likely wait until someone has an Islandora 8 site live and indexed by Google/Bing so we can test the real-world impact of multiple ontologies.

If it truly is a problem, then we can probably have a module pull in the JSON-LD and do a simple filter or map so only schema.org appears in the page's script element and trim the "_format=jsonld" off the URIs.