Closed kevATin closed 3 years ago
"Checksums" won't work, as peertube re-encodes everything on upload and any tiny change to the video formatting will change a normal hash. There are video fingerprinting algorithms that look at the actual content, but this can get more expensive. There are some maybe good compromises like perhaps doing a closer comparison on videos that have similar length. But ultimately I feel like this is getting into automated video moderation, which is hard to do ethically and correctly.
@scanlime Even if the source video is the exact same, the resulting re-encodes will differ? I knew that re-encoding was sometimes a bit off but didn't think it was this imprecise.
However if I remember correctly there was an open issue regarding the storage of uploaded video source files. Maybe those could be used instead?
Do you know of any open source video fingerprinting software?
Even with no automation whatsoever as a purely manual set of moderation tools, I think de-duplication would still be useful.
It's not useful to hash videos to detect duplicates, unless you are trying to detect folks who upload the exact same source file bits. If you remux the file, if you download it with youtube-dl, certainly if you transcode it at all, the hash will change. Video software is deterministic but complicated, and everyone's configuration is going to be slightly different.
Hello,
Video de duplication is out of the scope of PeerTube. A plugin or a third party tool could help. You could use for example a perceptual comparison instead of checksums. See this blog post (in french) by @rigelk https://rigelk.eu/blog/video-similarity/
You could use for example a perceptual comparison instead of checksums. See this blog post (in french) by @rigelk https://rigelk.eu/blog/video-similarity/
Sorry to comment on this closed issue but I wrote something[1] to solve this duplication issue and I believe it's more efficient than the solution in the link. Only one 64-bit hash is generated per input video, therefore the number of comparisons required is much lesser than the solution in the link. Also, one hash per video saves a lot of database space and the time complexity of the comparison drops to O(n).
Hello, I think that Peertube should have something to remove duplicate videos. Maybe not something as sophisticated as checking the hash, but something much simpler. My idea is that the user, who has one or more channels, should be able to have a button to search for duplicate videos of her and that they appear in a list. From there the user will be able to delete the duplicate videos according to his own criteria.
Suppose that in our eagerness to import videos from youtube before it disappears, we have imported the same videos several times without realizing it. A button to search for duplicate videos of my own user, based on the title and the possibility of keeping only one video, automatically deleting the others, would be very useful.
I don't think this is very heavy in terms of resource consumption and it would help us to have our videos more organized
Cheers MAX
This could be even simpler. A feature that compares video metadata and shows videos where the metadata (e.g. title and length) is the same would be very helpful to tidy up duplicates from resulting from YouTube imports.
This could be even simpler. A feature that compares video metadata and shows videos where the metadata (e.g. title and length) is the same would be very helpful to tidy up duplicates from resulting from YouTube imports.
This would only be limited in use though, since videos traveling the internet, getting reuploaded often get re-encoded and renamed over and over. Though it would still be better than to have no deduplication at all.
To my knowledge PeerTube does not yet have any features that allow for detection and potential removal of duplicate videos. I think having those would be very useful.
How to detect duplicates:
generate checksum for each video upon upload and put it into database
for already existing videos: have a background service compare checksums to find duplicates
for new videos: when a user uploads a video, compare it to the other checksums (potentially prevent upload?)
video report option to let channel owner and/or instance moderation team know that a video is a duplicate of another
moderation tool that lets moderators mark a video as a duplicate
How to deal with a duplication issue:
(Should be decided by moderators on a case by case basis or also allow automated actions?)
delete any but the oldest duplicate
delete any but the one with the highest quality
delete any but the most popular one (doesn't seem very fair though)
if a user can prove rightful ownership of a video that someone else uploaded before them, moderators should have the option to move the video to the rightful owner while preserving comments and upvotes, etc; or possibly merging comments and likes together..
I think it would also be useful for the moderation team and users to mark videos as duplicates of other videos (and ideally specify the original video) even if the instance does not have a policy of deduplicating. That way a federating option could be offered to skip videos marked as duplicates, that way instances that do and instances that don't take deduplication seriously could coexist without issue.
Why is this important for PeerTube:
duplicates split the amount of attention a video gets (or would get)
duplicates make it harder to find what you're looking for
duplicates put more stress on PeerTube instances; the WebTorrent functionality is only useful when many users are watching one video, with many duplicates around the chance to reach the threshold for WebTorrent being useful gets lower and lower.
duplicates put more stress on the PeerTube federation; duplicates get needlessly mirrored around between servers that might already have multiples of the same video
Any thoughts on this?