Open boehs opened 2 months ago
Yeah seconding this. Fences make good neighbors, and ward off lawsuits.
While I do think that in some respect this is a functionality covered by robots.txt, it would appear much of the AI industry seems to think robots.txt doesnt apply to them, so a more explicit llms.txt set of permissions clause,
Something like.
## Permisssions
Precedence: trainonly, referenceonly, allow, disallow
disallow: /
trainonly: /blog/archives
referenceonly: /current-data
allow: /blog
What this is saying is: the precedence of most valid to least valid is
This would provide a way for websites to choose what can be referenced, what can be trained on, and what must be excluded.
Possible complication: Maybe this would be extended to let people have separate permissions for different mediatypes. (Ie 'Yes you can train on the text, but please dont download all the videos for training')
possible arguement against: If this is intended for inference, maybe an extension for robots.txt to let website owners specify llm permissions is a better move.
If llms.txt is used to supplement perceived failures of robots.txt, it should also correct in the other direction. I'd like a way to specify a