Closed jqnatividad closed 10 months ago
Even with the judicious use of AccessRestriction
and UseRestriction
using NARA standards, a lot of information can still be inferred from other mandatory fields.
Consider adding best practice guidelines and perhaps, adding additional controls specifically for Search Engine/AI crawlers when the Implementation Guidance is created.
Joel,
Thank you for your submission regarding the potential use of DCAT-US metadata by web crawlers and AI models. While your suggestion to include a mechanism for publishers to mark resources as non-harvestable is insightful, it falls outside the scope of the DCAT-US specification. DCAT-US primarily focuses on defining a standard for data documentation and interoperability, particularly in government and open data contexts. Issues related to the control of data indexing and its use by emerging technologies, though important, pertain more to broader web standards and protocols, such as those discussed in the context of schema.org and similar initiatives. We recommend addressing these concerns through forums specifically dedicated to web publishing and AI interaction standards. A usage guideline has been provided for Legal Metadata in the spec.
Name: Joel Natividad
Affiliation: datHere, Inc.
Type of issue: General Comment
Issue: The rich machine-readable metadata exposed by the spec will be a gold-mine for crawlers - not only for search engines, but for training data for AI models as well.
It'd be great if there is a granular way that data publishers can have the ability to mark select resources as NOT harvestable/crawlable for indexing/inferencing purposes.
As it happens, Google is convening a discussion on exactly this topic.
https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
I recall when schema.org was announced in June 2011 by Google, Bing and Yahoo days before the Semantic Technology Conference and most attendees were caught by surprise by the development as the spec was developed largely outside the usual standards groups.
Since then, schema.org has been more transparent and worked with the community to evolve the spec. And sure enough, it has become the de facto standard for semantic markup on the web.
Recommended change(s): IMHO, DCAT-US v3 is certain to be used outside of traditional open data/government data interoperability use-cases as it aims to "provide a single metadata standard able to support most requirements for documentation of business, technical, statistical, and geospatial data consistently" and it'd be prudent to actively engage the search engine and AI firms who I'm certain will harvest the metadata at scale.