IIIF / cookbook-recipes

For working on the recipes
https://iiif.io/api/cookbook/index.html
37 stars 32 forks source link

0514 ml usage tags #516

Open glenrobson opened 4 months ago

glenrobson commented 4 months ago

For consideration & further discussion related to https://github.com/IIIF/cookbook-recipes/issues/514

Moved to the cookbook repo to build preview. Original Pull request #515

alliomeria commented 4 months ago

Hi @glenrobson, looking at the build validation errors right now. Will follow up and try to correct those in just a few moments.

alliomeria commented 4 months ago

Ah, I see I'm not getting the Validation right for @context. Did I misinterpret the extension mechanism?

alliomeria commented 4 months ago

Follow up notes: thinking the Policy Extension Registry for listing that Wikidata Q ref is not actually actionable as written (does not set true context.json; this would not either).

Also, it seems the validator itself has set parameters for rights: https://github.com/IIIF/presentation-validator/blob/c3283776ef60161e40677ac96632fa507602c0aa/schema/iiif_3_0.json#L248-L267

Can anyone point me to an example of valid usage of URIs in 'rights' that are not CC or RightsStatements.org? Any Local Contexts, Traditional Knowledge references, for example?

DiegoPino commented 4 months ago

@alliomeria I agree. The JSON schema is fixed on those 3 base URLs which differs from what the human readable specs say. JSON schema syntax allows for conditional oneOf (or any listings) based on e.g another value found somewhere else which would allow, under certain conditions to allow any external URL if e.g a different context was provided .. but .. that leads to a question: the Presentation API (3.0) specs suggest the extension mechanism for other URLs outside of the rights statements/cc domain, but it is not clear how? the extension mechanism would have any effect at all on a "value" of an existing base property or how could that be resolved in a validation at all. From my understanding the extension mechanist would allow to use other JSON keys/properties, like any additional @context, according to JSON-LD, would allow, but a mapping to another vocab/ontology does not necessarily define a "different" value for an existing key, specially on a key like "rights" which is already mapped to a very permissive (good) dcterms:rights in terms of what goes there.

Maybe there is space for interpretation in the specs for this?

azaroth42 commented 4 months ago

I think the specification is somewhere between misleading and flat out wrong here. We need a registry of additional rights URIs with an explanation of what a client should do when it encounters them.

Created https://github.com/IIIF/api/issues/2309 to this end.

Thanks @alliomeria for pushing into this somewhat unknown territory! :)

DiegoPino commented 4 months ago

@alliomeria @azaroth42 to allow this recipe to validate against 3.0 while 4.0 figures this out, would an alternative additional JSON/JSON-LD property coming from e.g schema.org (I'm thinking in specific of usageInfo) be used for the Wikidata ML tags? That way rights could still be, while on 3.0, CC based, and [usageInfo] (https://schema.org/usageInfo) serve as a stub in the meantime? Might be a stretch (sorry, don't want to derail this great effort) and I don't know if schema.org is even valid in the IIIF specs as a registered extension.

Just an end of the day idea but it might cover the machinable part since I am pretty sure some crawlers like Google do know how to map/parse read that complete @context and thus their properties and values.

azaroth42 commented 4 months ago

You could create a JSON-LD context document that defines a new property for IIIF (either de novo or by mapping from an existing ontology like schema) for sure. It would then fall into the extensions part of the spec directly and you could put whatever values you wanted in it.

alliomeria commented 4 months ago

I think the specification is somewhere between misleading and flat out wrong here. We need a registry of additional rights URIs with an explanation of what a client should do when it encounters them.

Created IIIF/api#2309 to this end.

Thanks @alliomeria for pushing into this somewhat unknown territory! :)

Thanks for pushing this into a discussion of potentially revising the specs for 4.0 (*1.4.0/1.3.0 = Archipelago versioning đŸ€“ ), @azaroth42. :) I was a bit puzzled try to piece out how to work with the extensions mechanism as currently described.

For 3.0, I appreciate your suggestions @DiegoPino for a potential actionable way to address right now. Happy to give that mode an attempt...

Diego, Rob, Glen, or anyone watching this issue, what else might you suggest as ways to work within the current specs to provide valid, actionable rights statement (broader sense of this terminology) declarations?

Also, what exactly is the functional result of a client hitting the current version of rights URIs from CC/RightsStatements.org?

alliomeria commented 4 months ago

Looking around a bit more, came across this: https://github.com/IIIF/api/blob/main/source/registry/rights/index.md

Maybe timely for proposing filling out for CC/RightsStatements.org, and inclusion of alternate rights statements, including Local Contexts notices, Traditional Knowledge labels, and others (like these 🙃 )?

Or perhaps the Rights Registry is an unused/abandoned area?

glenrobson commented 4 months ago

Discussed in cookbook meeting. Suggested way forward:

kirschbombe commented 4 months ago

Just FYI on the Registry pages, we've been working on adding Registry pages and moving things from the annex to the Registry as we make update. I haven't worked on it in a bit, but probably have a branch with the in-progress changes. If the group comes to together on agreed URIs to add, let me know and I will add them to my draft. I can also prioritize work to update the Rights registry page.

kirschbombe commented 4 months ago

Here's the draft PR for the Registry page with a preview link: https://github.com/IIIF/api/pull/2248

alliomeria commented 4 months ago

Based on the feedback received on this PR and during today’s cookbook call, I made a few additional updates to the recipe. Check it out here: https://preview.iiif.io/cookbook/0514-ml-usage-tags/recipe/0514-ml-usage-tags/

Thank you very much to everyone who shared helpful feedback and recommendations so far, I really appreciate your time and consideration. Looking forward to continuing to work with the community to discuss this and potentially move further along through the official pipelines.

alliomeria commented 4 months ago

Hello everyone watching this pull request 👋

During yesterday’s IIIF AI + ML Group Meeting, I had the opportunity to present again on this proposal for ML/AI Usage statements in IIIF Manifests. In the presentation follow up discussions, Ellen Van Keer shared an interesting question/comment about how these statements might work within the context of the recently enacted EU AI Act. Specifically, Ellen was concerned that EU organizations might not be able to apply these usage statements if they were not the primary copyright holder for a given object/resource.

Ellen, thank you so much for raising this important topic during the call. If possible, could you please provide references supporting the concern that institutions may not be able to apply any kind of 'opt-out' or usage statement unless they are the primary copyright holder for a work? I'm also curious how this might apply to other usage/rights statements already at play currently.

From my what I am reading stateside, I am not seeing the text for opt-out mechanisms or usage statements described in that particular way. In the law itself (EN version here), I see this text "Where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightsholders if they want to carry out text and data mining over such works."

In legal analyses and practical discussions, such as the ones noted here and a few more below this message (sorry for the many links, trying to go through due diligence), it seems like the EU AI Act and TDM Directive can be interpreted differently and are still not yet fully defined.

From this source:

"The EU AI Act contains a provision that equates AI/machine learning with “text-and-datamining” (TDM) under the EU Text and Data Mining Directive.[1] Consequently, “machine learning” is allowed, provided that:

The EU AI Act is expected to enter into force in 2024 and will fully apply 24 months thereafter. However, the TDM exception under the EU Text and Data Mining Directive already exists. Therefore, the TDM exception for machine learning can already be enforced in anticipation of the EU AI Act’s interpretation.”

Would anyone be willing to share their perspective on the potential procedures or enforcement mechanisms at play for the EU AI Act and TDM Provision in terms of opt-out requests, any kind of usage statements applicability? Is this an area where it is anticipated there is going to be variance between institutions and local policies?

In any case, I think that these ML/AI Usage Statements could still be useful, and even provide an actionable mechanism for helping an institution comply with an "opt out" request that was "expressly reserved in an appropriate manner".

Thanks again for bringing up this important related potential factor, Ellen. And thanks to everyone who has been taking this proposal into consideration and sharing feedback. I really appreciate everyone's time and expertise.

Additional Links:

alliomeria commented 3 months ago

Hello everyone who may watching this repo/issue, checking in to see if anyone might have time to add some follow up comments to the issue discussion noted here. It would be great to have more perspective about the EU AI Act & TDM Provision considerations. Thanks for your time! (I also gave a IIIF Slack ping, so apologies for the redundancy in messages related to this.)

veesalu commented 3 months ago

I believe Ellen referred to the DSM (Digital Single Market Strategy) which also applies to digital repositories of cultural heritage institutions in the EU.

The copyrights and licenses question is rather firmly regulated and only the rights holder has the right to assign licenses or access/usage terms to the works in copyright. CHIs in the EU can't legally assign licenses for works for which we don't own the rights.

There are different ways for getting material into the collections but for National Library of Estonia that is based on the Legal Deposit Copy Act. A publication must be submitted to NLE and we have the obligation to preserve it long term and make it available in accordance with the Copyright Act. During the act of deposit and based on the Copyright Act the rights holder assigns licenses and/or terms (in our case either CC or RightsStatements) based on their wishes and intentions. We also don't have the right to change access and usage terms of oprhan and out-of-commerce works based on our own judgement, we use EUIPO's portals in order to get the grounds to make them available.

Now, I guess it depends on how the ML usage tags are defined in the landscape of access and rights. The first idea that also came to my mind during the call was that we could use the tags for our own publications and to others we could add "check with the rights holder". Like Ellen, I doubt we could legally bindingly apply the tags for other rights holders' works if the ML usage tags are defined as licenses or usage conditions similar to CC or RightsStatements.

Regarding the TDM, data and text mining of works in copyright can only be done on the premises of NLE and outside researchers can only leave the premises with cleaned and worked on data. Also, the research must be done in "motivated" amount, meaning they can't collect and use our entire collection. In order for a researcher or research institution to get access to the data, they need to file an application in which they present their research, justify the need for the data and explain what they do with it. So, the TDM simply doesn't mean that anyone who says that they do research have automatic access to the data. There was a legal analysis carried out for our digital lab, there's a summary in English -> https://digilab.rara.ee/wp-content/uploads/2023/03/Virtual-LAB_eng_oigusanaluus.pdf

Another issue that I have been thinking about (and have yet to reach a conclusion also for myself) is that in Europe, we are in the make-it-accessible-and-reuse-freely stage. European Commission is geared towards accessibility, popularisation and re-use of cultural heritage. It's quite impossible to get funding for infrastructure or software if you don't plan "smart solutions". At the NLE we are currently applying for funding for three (two EC funded) AI/data science projects and we are building our own ML solution for automatic cataloguing. We have the European Collaborative Cloud for Cultural Heritage and Common European Data Space for Cultural Heritage, which both aim to reduce duplication of data and improve collaboration. I'm not sure how easy sell the ML usage restrictions could be in Brussels. But this already is a whole other discussion in itself.

alliomeria commented 3 months ago

Thank you for your thoughtful follow up veesalu, and for providing a link to the NLE's analysis that informs your institution's approach to TDM. It's really interesting to read through the perspective and the context you're all working with. Good luck with your pursuits of AI/data science projects and ML cataloging assistance tools. Looking forward to reading about your outcomes down the road. :)

Related to this topic, I wanted to note that CC is moving towards being receptive to the idea that perhaps creators should be able to have additional options within the CC licensing framework, loosely termed as "preference signals" in this blog piece where they discuss their early explorations on this concept: https://creativecommons.org/2024/07/24/preferencesignals/. (I will be reaching out to CC to ask about this, maybe there's some space for collaboration at a shared table.)

I understand that applying nuance within the frameworks of open sharing culture can be challenging. That said, I still think there are ways we can better attune our practices to the complexities around the considerations artists, authors, and other creators and content caretakers are facing in the modern AI/ML internet landscape.

veesalu commented 2 months ago

That was very interesting reading, thanks!

I lean towards agreeing that tags are of use, the issue is that in the EU only the rightsholders can opt-in or opt-out and in most of the cases, we are not the rightsholders. So implementing those is not just in-house project of deciding that from now on we do it like that, it needs bigger change in processes in general. I'm a little bit on the fence about this issue, because, on the one hand, I feel that we should make as much of CH accessible as possible, but on the other hand, we need to consider reasonable infrastructure loads, etc.

Another issue in using or not using CH data in ML is that right now LLMs don't really speak small languages (like Estonian) and the text corpora held in our collections is valuable material for training the models.

I do agree that this is a discussion that we should continue.

alliomeria commented 2 months ago

This is definitely a topic and practice area with a lot of nuanced considerations at play, for cultural heritage related and other fields that may end using ML technology. I think the true usefulness of ML assisted tools really remains to be seen in many ways, and I hope that there are will be actual comparative studies conducted in CH/GLAM for analyzing the effectiveness of ML tools compared with traditional practices and other technical approaches. I also hope that we can have an impact on how particular ML technical applications are developed for our field.

Thanks again for your sharing your time and feedback related to this issue.