element-hq / riot-meta

A place to experiment with tracking features at a higher level than Riot web/iOS/Android.
25 stars 6 forks source link

Delete media nobody has access to any more #166

Open lampholder opened 6 years ago

turt2live commented 6 years ago

For the implementation it would be nice if this went through the media provider structure that already exists.

In an ideal world the feature could live independently of synapse, although I question if that's possible.

Also, encrypted media will appear as "unreferenced" and we'll have to be careful not to delete it.

Half-Shot commented 6 years ago

You'd need to start tagging media with roomids surely to be absolutely certain they are not referenced anywhere else? Remember media can also be inline too so they might not appear in m.image/m.video/m.audio.

Also the above suggestion would weaken encryption somewhat by adding yet more metadata, so you'd need to be very careful with this.

richvdh commented 6 years ago

How do we know when nobody has access to it any more?

richvdh commented 6 years ago

(I guess that's what @turt2live and @Half-Shot are saying. but generally: I don't really know what this means)

lampholder commented 6 years ago

It might be that we need to tag media items with event ids such that, when we process the erasure of the event id we can simply also erase the corresponding media at the same time. If this is the case, we can probably populate the missing event ids for existing media in the repo by crawling through all the accessible message history.

Encrypted events will not be available to crawl in this way, but that might be fine because:

  1. I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted
  2. Historical encrypted media need not necessarily be managed in the same was as unencrypted media, because the users' management of encryption keys essentially gives them the control they need

We probably also need to think about whether we need special provision for media repos forwarding a GDPR Art. 17 request somehow. My gut feeling is that this should not be handled by the media repo - the media repo should just delete media it is asked to by the homeserver when the homeserver wants to delete a media event.

Finally, this has some implications for recycling mxcs (inasmuch as that would have to be discouraged if media weren't just to disappear by surprise). I think there's a very strong case for discouraging the recyling of mxcs, though.

Half-Shot commented 6 years ago

I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted

There are a few reasons why this might not be the case:

Though the former, you could probably grep events (at great expense of performance) for any mentions in the body. The latter arguably isn't really a good use of Matrix, but it is still valid.

I guess what I'm saying is unless you can absolutely prove the media isn't being used by anything then it's best not to delete it.

An alternative might just be retention polices based on the last access date? If you know that:

I think there's a very strong case for discouraging the recyling of mxcs

I'd be shocked if there was ever a reason to recycle mxcs.

turt2live commented 6 years ago

I think we can assert that any media item in the media repo that cannot be tied to an event must have been created in association with an encrypted event, and therefore is itself encrypted

User avatars can't really fit into this: it'll be a 1:many relationship because whatever crawler would pick up on more than 1 event referencing the media. If the avatar gets redacted in a room it shouldn't delete it.

There's also the case of people using the media repo directly for whatever reason, and not referencing it in matrix. For instance, the IRC bridge auto-pastebins long messages via the media repo.


Federated media is a bit harder: who has a copy of the media? It may not be possible to guarantee that all servers in the room have the media, and it may have been cached by parties not in the room as well. It may be acceptable to just forward a bulk delete request (because individual requests would be bad) to other servers, specing that they are required to honour it with the known caveat that we can't force another server to delete something.

See also:

On the not-GDPR front: deleting media that has been redacted is another hard problem to solve due to forwarded events, the person redacting may not belong to the origin server, etc.

Half-Shot commented 6 years ago

@turt2live I understood this issue to be about locally removing media rather than nuking it for everyone. From a space saving perspective? Even from a GDPR pov do we care what other's store?

lampholder commented 6 years ago

I guess what I'm saying is unless you can absolutely prove the media isn't being used by anything then it's best not to delete it.

This might be a philosophical question. Is the matrix media repo a place to store media, or a place to store media in support of events in matrix rooms? Also there are propbably different versions of 'best' here - under GDPR it might be that, if we couldn't answer when asked why we have something, perhaps we shouldn't have it :\

User avatars can't really fit into this: it'll be a 1:many relationship because whatever crawler would pick up on more than 1 event referencing the media. If the avatar gets redacted in a room it shouldn't delete it.

Gah, I'd forgotten about avatars - would we be able to do some smart inspection of the event types to fitler those though? And I was kinda thinking we'd have to handle the 1:many relationship anyway (to handle event forwarding/other random mxc recycling) - would it be tractable to associate the mxc with the 'first' event referencing it?

On the not-GDPR front: deleting media that has been redacted is another hard problem to solve due to forwarded events, the person redacting may not belong to the origin server, etc.

I'm thinking we'd want to reconsider the 'forwarded event' idea (although if all media is created with a reference to an associated event id going forward this wouldn't necessarily be a problem since you could identify when an event was not the media-creating event).

Also I forgot to capture the complexity that associating all new media with an event id might be convaluted since you'd need to upload the media to get the mxc to put into the event to get the id to put into the media repo...

turt2live commented 6 years ago

@turt2live I understood this issue to be about locally removing media rather than nuking it for everyone. From a space saving perspective? Even from a GDPR pov do we care what other's store?

@Half-Shot I'd imagine as part of GDPR best effort should be applied to try and erase the user's existence.

The complication of proving ownership of media could just be "server name matches" with a signed request (see also: https://github.com/matrix-org/matrix-doc/issues/701#issuecomment-394121896)

would it be tractable to associate the mxc with the 'first' event referencing it?

Depends entirely on how you'd want to consider it. If someone forwards an image someone else sent - who owns that image? Both people can probably be considered the "owner" of the media and therefore linked to it, despite the original uploader being the only one associated. If the second person wanted to be forgotten, should that image be deleted?

There's also the special case of stickers (and probably a ton of other stuff): surely if someone deletes their account then we shouldn't go around deleting stickers (because mxc reuse).

Is the matrix media repo a place to store media, or a place to store media in support of events in matrix rooms?

"Yes" is kinda the answer, unfortunately. Because the repo is so generic some people use it as a CDN while others (probably most) use it as intended: for matrix. There's a couple people out there that (for some reason) host their entire website off the media repo. Even I'm personally going the direction of using the media repo to give bots avatars on my website (instead of having the media duplicated everywhere).


One possible solution (that falls apart quickly with the right to erasure) for the redacting media side is to just let the client deal with it. The client would send a DELETE to the media repo alongside the redact to the homeserver. This has the concern of proving ownership (or authority) to delete the media, but it does mean that the user has the option to hard-delete encrypted media (at least from their server).

turt2live commented 6 years ago

fwiw, the advice I got from unpaid lawyer irl friends was it might be best to try and associate media with users rather than events. The redacting problem can probably be pushed further down the line, despite my attempts to solve it alongside gdpr.

(the legal advice is what drove this btw: https://github.com/turt2live/matrix-media-repo/issues/96)

Half-Shot commented 6 years ago

Actually, I'm kinda surprised the local homeserver doesn't log who uploaded it given we require the access_token anyway.

richvdh commented 6 years ago

We do track the user that uploaded a given bit of media. The problem we have is that we've decided not to automatically delete all content when a user asks to be erased (cf https://matrix.org/blog/2018/05/08/gdpr-compliance-in-matrix/) so that information doesn't help us.

richvdh commented 6 years ago

[though I wonder if this is a rather dangerous situation - it works for messages because we will restrict access to users who were in the room, but there is nothing stopping the url for a bit of media being available somewhere outside of an event, and public for everyone to see, despite the uploader having been asked to be erased]

lampholder commented 6 years ago

For messages, the homeserver replicates the 'email experience', so users can always see the messages that were sent to them even after the sender executes their right to erasure.

For simplicity of reasoning, it would be great if the media repo could replicate the same experience. But today it can't, 'cause it has no concept of ACLs or visibility - if you have the URL, you have the data blob.

All we know from the media repo is that a given piece of content was uploaded by a given user. I don't think the media repo exposes this information at all (if I'm reading the docs right).

As described in my comment the other day, if we associate a media item with the event id of the event that posted it, this supports our deleting the media at the point at which we're erasing the event from our database (the point at which no active matrix user can claim visibilty of that message). By itself, this does not address @richvdh's point - we'd still be serving the media to anyone with the URL, which could (easily) have been shared out of band.

We could more radically overhaul the media repo to both make it aware of event ids and make it validate the requester's right to see that media by piping the event id back through synapse. This isn't going to happen quickly, though, without a lot of collateral damage. At the minimum we'd need to:

If we don't do the above, then we have a range of options between, I think, two extremes. When Alice GDPR17s herself:

  1. we delete all media uploaded by her and pass on a request to all federated homeservers to please do the same
  2. we draw the parallel that having the unique URL is the same as having the content, so in posting a media event with a reference to the mxc, Alice has transferred that media from herself to anyone who recieves it (via whatever means). We delete nothing.
richvdh commented 6 years ago

we draw the parallel that having the unique URL is the same as having the content, so in posting a media event with a reference to the mxc, Alice has transferred that media from herself to anyone who recieves it (via whatever means). We delete nothing.

which might be fine if the average user had the first clue what an event with an mxc was. We go to a lot of effort to make it easy just to send a cat gif to a room - Alice has no reason to realise that there is a whole separate media repo, and can expect that, having realised her error, become a dog person, and requested erasure, that we won't continue to serve incriminating cat evidence.

lampholder commented 6 years ago

If someone forwards an image someone else sent - who owns that image? Both people can probably be considered the "owner" of the media and therefore linked to it, despite the original uploader being the only one associated. If the second person wanted to be forgotten, should that image be deleted?

I think this is really just a case of making a decision (by which I mean I think there's a tractable techincal impl regardless of which conclusion we draw as to the philosophical or legal ownership of the media content).

Personally, I'd like to keep it simple and say "you didn't forward the media, you forwarded a reference to that media, and the media repo's contract (to leave the media a given mxc refers to intact) is only with the uploader, nobody else".

If we're explict about this, and as builders of Riot.im give users features that are not likely to fall foul of this, then I think we can have much simpler lives by dissuading the recycling of mxcs generally.

richvdh commented 6 years ago

I don't think the media repo exposes this information at all (if I'm reading the docs right).

correct, afaik.

lampholder commented 6 years ago

which might be fine if the average user had the first clue what an event with an mxc was. We go to a lot of effort to make it easy just to send a cat gif to a room - Alice has no reason to realise that there is a whole separate media repo, and can expect that, having realised her error, become a dog person, and requested erasure, that we won't continue to serve incriminating cat evidence.

Certainly whatever we decide, we can do some good work to clarify the situation by adding some additional UX to the media upload (of course, that only really helps with Riot.im, unless we do something really weird to the media upload API).

lampholder commented 6 years ago

We can of course treat the two services as entirely distinct, with distinct erasure policies.

With media items' being associated with the user id, we could give the user a 'media control panel' they can use to see all the media they've put in a given repo, to then erase it (or submit an erase request that we honour after n whenevers). They could then choose to erase all media on account deactivation (with the similar warnings about what that will do to other users' experience of the service).

Another idea - we could enhance the media repo so that media has an expiry date, which is advertised in the UX when the expiry date is close, and which "media owners" can choose to reduce to 30 days from today if they like.

turt2live commented 6 years ago

fwiw, the concept of "who can see this media" starts to overlap with https://github.com/matrix-org/synapse/issues/2150

turt2live commented 6 years ago

I'm not personally a fan of expiration dates on media for various reasons. Primarily, it makes backlogging hard (as some people will set insane expiration times), and searching the room's history can become a useless effort. Obviously if the user decides to erase all their media then these points still apply, however that's more of an acceptable risk to me than having a 30, 60, or whatever day expiration.

Linking events to media has the further concern of leaking metadata in encrypted rooms.

I somewhat suspect whatever gets chosen as a way to identify who can access media (https://github.com/matrix-org/synapse/issues/2150) will end up also being able to identify who owns the media. From there it's just a matter of when someone asks to be erased, the media repo runs DELETE FROM media WHERE owner = '@travis:t2l.io'

ara4n commented 6 years ago

I've written up a proposal for this at https://github.com/matrix-org/matrix-doc/issues/701