bibanon / tubeup

Use yt-dlp to download video and upload to the Internet Archive with metadata.
https://pypi.python.org/pypi/tubeup/
GNU General Public License v3.0
414 stars 70 forks source link

Sanitize User Data from YTDL-generated JSON Metadata File #119

Closed hunter0002 closed 4 years ago

hunter0002 commented 4 years ago

Currently, the youtube-dl JSON uploaded to archive.org includes various metadata, including the full directory name of the video file and googlevideo.com source URLs. With default settings this leaks the user's home directory name and his/her OS, and often his/her IP address as well. tubeup also does not inform the user that the data will be stored publicly on archive.org.

Either:

vxbinaca commented 4 years ago

Also your issue isn't correct, this isn't our problem but youtube-dls. We don't actually create the JSON metadata, that program does.

vxbinaca commented 4 years ago

@brandongalbraith thoughts?

hunter0002 commented 4 years ago

Also your issue isn't correct

thanks, I've corrected the issue description

vxbinaca commented 4 years ago

My testing with youtube-dl shows JSON is a all or nothing affair. We need JSON to both preserve important metadata, both for preservation and for the items creation. Youcan't just tell youtube-dl to not collect the filename. You need to take this to them, it's not our issue. I don't want to get into the habit of post-processing metadata.

Closing, but I'm going to add a warning on the README.

vxbinaca commented 4 years ago

This may have to do with how we point the file to a certain directory. it's not reproducing the full directory when I separately generate JSON with youtube-dl.

@jjjake Would you suggest handling this on our end (somehow) or y'all stripping directory information as apart of the deriving process?

hunter0002 commented 4 years ago

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address. This doesn't seem to happen for all requests.

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

vxbinaca commented 4 years ago

The tubeup JSON does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with youtube-dl. Not sure why

Thats not our JSON, also please start properly indenting examples so we can see what you're talking about.

hunter0002 commented 4 years ago

I've tried to make the wording clearer. I'm not sure if it's related to the ip key/value being part of the URL only for certain googlevideo.com requests.

vxbinaca commented 4 years ago

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with regular youtube-dl. Not sure why

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

That looks like the IP address resolved from Youtube, not you. So not a major problem. The directory thing is a minor problem that can either be done in a code fix (send a pull request) or on IAs end.

Hiding that IP makes it so I can't WHOIS to check for sure.

hunter0002 commented 4 years ago

I can confirm it's my IP address, I tested with both searching DuckDuckGo for "ip" and running dig @resolver1.opendns.com ANY myip.opendns.com +short.

vxbinaca commented 4 years ago

I concur, youtube-dl prints the public IP address and user directory of the video file. The directory thing, not a huge issue but a issue. IP, thats a larger problem.

Submit a pull request with a fix and I'll test.

hunter0002 commented 4 years ago

I can't write in Python so unfortunately I'll have to leave a PR to someone else.

vxbinaca commented 4 years ago

Neither can I. This goes beyond my skillset of minor tweaks.

brandongalbraith commented 4 years ago

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

brandongalbraith commented 4 years ago

@vxbinaca @hunter0002 Dropped some context in https://github.com/ytdl-org/youtube-dl/issues/25576, asked if the youtube-dl folks are willing to sanitize the data (which is preferable versus future metadata spot checks and sanitization updates on our end). If not, should be trivial for us to regex out the IP addresses from the format links and drop the _filename k/v entirely.

vxbinaca commented 4 years ago

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

You have to sanitize v6 IPs too which is more complex. Doing the paths is easier.

hunter0002 commented 4 years ago

I've filed a new issue (25681) since 25576 was closed, can't be reopened and is apparently being ignored

hunter0002 commented 4 years ago

The new issue has been closed, which helpfully answers the open questions:

So the former will have to be changed on the tubeup end, and for the latter it would have to be decided whether or not the working URLs are worth keeping? Generating working URLs which don't contain the IP address might require e.g. sending all requests through Tor or another open proxy by default.

vxbinaca commented 4 years ago

@hunter0002 I have some bad news for you: They need that information.

It's looking like right now, if you use Tubeup you need to live with the reality that your public facing IP and a path to a file will be in metadata. Unless IA processes it out, youtube-dl won't fix it so stop making issues with youtube-dl. They will not fix it.

It's either gonna be:

1) Fixed on our end via post-processing 2) Done as apart of a process on IAs end in the deriving process 3) You live with that information being in metadata which by the way if you were using a VPS like the README recommends wouldn't be a serious issue - especially since you'd be connecting to the box SSH keylessly with good generated keys, right?

I know I deserve a Pwnie Award for saying that. I don't care. You might not get this issue resolved is what I'm saying.

Edit: We, I, set path because I don't want morons who use this script to dump 100 gigabytes of video and metadata into the CWD. We also need a fixed directory for the archive file for the previously ripped videos.

hunter0002 commented 4 years ago

I made the issue because Brandon's comment was apparently being ignored. Since the new issue was responded to very quickly we can move on from that, and I'm not going to make any further issues or comments in the tubeup repository.

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS even if it's best practice to do so. Especially considering that tubeup is still the easiest way to get a YouTube video to display in the Wayback Machine, I would expect the majority of tubeup users to inevitably have only uploaded a few videos each, with a very small minority of power users having uploaded the majority of videos (given that this sort of usage curve exists for most other software/projects for which data is available). And if you're only going to upload five videos, why the hell would you bother with a more complex setup like that? It might well take longer to set up the VPS than for the videos to be download and then uploaded to IA.

vxbinaca commented 4 years ago

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS

You could use a VPN too. Doesn't fix the file path issue but it's a start. Tor is slower old people getting off a bus.

Either someone will come up with a way post-process metadata or users will live with it. I'm simply too busy to learn Python to fix this. Edit: This is ontop of the fact Tubeup works with a destination website, and a program with hundreds of possible source websites. This is a complex situation to deal with.

Closing, and I'm going to add a disclaimer to the front of the page that your examples users never read anyway.