AnswerDotAI / llms-txt

The /llms.txt file, helping language models use your website
http://llmstxt.org/
Apache License 2.0
212 stars 17 forks source link

Opt Out #3

Open boehs opened 2 months ago

boehs commented 2 months ago

If llms.txt is used to supplement perceived failures of robots.txt, it should also correct in the other direction. I'd like a way to specify a

  1. Blanket opt-out of my site being used by LLMs
  2. Or a price tag for the usage of my site's content, alongside instructions on how to license it.
shayneoneill commented 2 months ago

Yeah seconding this. Fences make good neighbors, and ward off lawsuits.

While I do think that in some respect this is a functionality covered by robots.txt, it would appear much of the AI industry seems to think robots.txt doesnt apply to them, so a more explicit llms.txt set of permissions clause,

Something like.

## Permisssions
Precedence:  trainonly, referenceonly, allow, disallow
disallow: / 
trainonly: /blog/archives
referenceonly: /current-data
allow: /blog

What this is saying is: the precedence of most valid to least valid is So what its saying is, "The default here is say away. However you may train your data on the archives but not reference it in answers, you may only reference current-data, but not train on it. For the /blog directory however, you may do both, but since trainonly has higher preferences you must exclude the archives from referencing.

This would provide a way for websites to choose what can be referenced, what can be trained on, and what must be excluded.

Possible complication: Maybe this would be extended to let people have separate permissions for different mediatypes. (Ie 'Yes you can train on the text, but please dont download all the videos for training')

possible arguement against: If this is intended for inference, maybe an extension for robots.txt to let website owners specify llm permissions is a better move.