Take into consideration a granular way to forbid crawlers from indexing/harvesting select resources

jqnatividad commented 1 year ago

Name: Joel Natividad

Affiliation: datHere, Inc.

Type of issue: General Comment

Issue: The rich machine-readable metadata exposed by the spec will be a gold-mine for crawlers - not only for search engines, but for training data for AI models as well.

It'd be great if there is a granular way that data publishers can have the ability to mark select resources as NOT harvestable/crawlable for indexing/inferencing purposes.

As it happens, Google is convening a discussion on exactly this topic.

As new technologies emerge, they present opportunities for the web community to evolve standards and protocols that support the web’s future development. One such community-developed web standard, robots.txt, was created nearly 30 years ago and has proven to be a simple and transparent way for web publishers to control how search engines crawl their content. We believe it’s time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases.

https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/

I recall when schema.org was announced in June 2011 by Google, Bing and Yahoo days before the Semantic Technology Conference and most attendees were caught by surprise by the development as the spec was developed largely outside the usual standards groups.

Since then, schema.org has been more transparent and worked with the community to evolve the spec. And sure enough, it has become the de facto standard for semantic markup on the web.

Recommended change(s): IMHO, DCAT-US v3 is certain to be used outside of traditional open data/government data interoperability use-cases as it aims to "provide a single metadata standard able to support most requirements for documentation of business, technical, statistical, and geospatial data consistently" and it'd be prudent to actively engage the search engine and AI firms who I'm certain will harvest the metadata at scale.

jqnatividad commented 1 year ago

Even with the judicious use of AccessRestriction and UseRestriction using NARA standards, a lot of information can still be inferred from other mandatory fields.

Consider adding best practice guidelines and perhaps, adding additional controls specifically for Search Engine/AI crawlers when the Implementation Guidance is created.

fellahst commented 10 months ago

Joel,

Thank you for your submission regarding the potential use of DCAT-US metadata by web crawlers and AI models. While your suggestion to include a mechanism for publishers to mark resources as non-harvestable is insightful, it falls outside the scope of the DCAT-US specification. DCAT-US primarily focuses on defining a standard for data documentation and interoperability, particularly in government and open data contexts. Issues related to the control of data indexing and its use by emerging technologies, though important, pertain more to broader web standards and protocols, such as those discussed in the context of schema.org and similar initiatives. We recommend addressing these concerns through forums specifically dedicated to web publishing and AI interaction standards. A usage guideline has been provided for Legal Metadata in the spec.

DOI-DO / dcat-us

Take into consideration a granular way to forbid crawlers from indexing/harvesting select resources #125