drzraf commented 1 year ago

Long needed, long requested feature to stop leaking contributors/uploaders IP addresses.

145, #146, #192

Here finally implemented by carefully parsing the URL as suggested previously.

Includes unit tests
Only replace the JSON file on-disk if actually necessary

brandongalbraith commented 1 year ago

Not all heroes wear capes. Will review tomorrow and if everything looks good, we'll ship at as soon as possible. Thanks for putting the time into this.

vxbinaca commented 1 year ago

flake8 error.

Do you have example test upload from this branch?

edit:

Do we need to lock that dep version? Can that be lifted?

vxbinaca commented 1 year ago

I have questions @drzraf

Do we need the urllib dep locked to that version?
Can you show me a example upload where the IP address is redacted?
Can you show this doesn't trigger false positives?

When you answer these three questions I'll merge.

drzraf commented 1 year ago

* Do we need the urllib dep locked to that version?

No, I just reverted a previous commit for sake of commit history/revision tracking/reviewing process. I can remove the version lock if you prefer so.

* Can you show me a example upload where the IP address is redacted?

It's not clear to me what do you want and why. A modified file? It easy to generate one:

#!/usr/bin/python3
import json, sys
from tubeup.utils import strip_ip_from_meta

with open(sys.argv[1], 'r') as f:
    _, new_meta = strip_ip_from_meta(json.load(f))
    print(json.dumps(new_meta))
# Run with test.py tests/test_tubeup_rootdir/downloads/Mountain_3_-_Video_Background_HD_1080p-6iRV8liah8A.info.json

I'm attaching a modified file if that's what you meant: a.json.gz

* Can you show this doesn't trigger false positives?

It's safe because:

Transformations only apply to 4 specific fields, all of them are URLs
For the query-string: URL are parsed with a dedicated standard library rather than regexp, no error is possible
For the path segment, it's processed separately from the rest of the URL limiting the scope
I assume path segment processing regexp to be safe unless told otherwise: r'/ip/[^/]+', r'/ip/REDACTED'
- /ip/ is very specific (and I've empirically looked for it among hundreds of info files) and following the observed structure of Youtube URL, is always followed by a (non-empty) IP address.
- The non-greedy segment match is also safe
TBH I've not looked at all the 1068 possible websites supported by youtube-dl to check if any of them would return an URL containing /ip/ followed by something which is not an IP address (or using ip= parameter for some other purpose), nor have I resources and will to do so. The underlying idea is to fix an actual grave concern more than overthink an improbable rare inoffensive future one.

Let's keep in mind that, overall, these files contains metadata related to the downloaded video. Some are very useful, other are garbage. In the case of these URL, then often contain expiring tokens/query parameters making them unreachable after a specific amount of time: they are in no way stable or canonical resources/identifier (some suggested to even drop them entirely). BTW, Google could still identify uploaders based on the temporary tokens parameters which this PR doesn't currently redact.

tl;dr : These are mostly unimportant metadata but reasonable efforts have been provided to ensure safely obfuscation so that no side-effect is expected on Youtube and there is very very low probability a significant side-effect happened to the URLs from another provider.

vxbinaca commented 1 year ago

BTW, Google could still identify uploaders based on the temporary tokens parameters which this PR doesn't currently redact.

Useless because most people upload from Linux where they don't normally log in, or use a separate cookie file.

Edit:

Thanks for this I'll cut a new version later.

vxbinaca commented 1 year ago

It's not clear to me what do you want and why.

I was asking for a link to a example upload where you tested this PR. Beyond just the test. I wanted to inspect the results of what it does.

vxbinaca commented 1 year ago

Your PR broke Tubeup. Expect a hard reset on the repo, and extreme scrutiny and testing standards for your PRs in the future.

drzraf commented 1 year ago

Would you mind explaining what did it broke? NB: I have diff-ed info json files before/after processing, and visually compared them in order to ensure adequate replacements.

vxbinaca commented 1 year ago

Would you mind explaining what did it broke? NB: I have diff-ed info json files before/after processing, and visually compared them in order to ensure adequate replacements.

This is a lot of effort to not show me a URL.

That PR put me through so much work, I tried to lift then re-added the urllib3 version lock, pushed new versions to pypi. I had to yank 3 releases due to that PR then hard reset Github I couldn't revert your PR due to me doing a bunch of other work after it, which is now all gone.

Personally I'm not seeing a security issue from the IP in a VOD URL. Use a VPS.

drzraf commented 1 year ago

show me a URL, while syntactically correct, makes no sense. URL shown: "https://google.com". Could you please be more explicit?
You didn't, by any mean, explained what was wrong with the current PR. What did it actually broke?
You explained why doing a force-push is painful. (Well, nothing stopped you from just reverting, but reasons are yours)
Leaking IP addresses always been and will always be a problem (That's the reason we haven't yet encountered someone disclosing it on his GitHub profile)

My main question remains : what's wrong with the PR? By answering, it could be improved, tested, and finally merged.

I obviously assume good faith. I also assume only technical implementations details are at stake here (unlike in #311). If, for any reason, you wished submitter's IP addresses to end up published on archive.org, then it'd be definitely preferible to state it formally so that the discussion could be elegantly and quickly closed.

Otherwise, I suggest to reopen the PR and offer details about the problems it may have cause, so that I revamp and improve it.

thx

mrpapersonic commented 1 year ago

show me a URL, while syntactically correct, makes no sense. URL shown: "https://google.com". Could you please be more explicit?

He just wants the URL of a successful upload with this patch. I'd assume your code is right, but only works for YouTube and breaks other sites.

vxbinaca commented 1 year ago

No it broke everything.

vxbinaca commented 1 year ago

@mrpapersonic If you want this PR in, I want it verified it works. So since the author cannot just give me an example URL, if you want this in make sure it works first. Because me and #313 saw how it was broken

vxbinaca commented 1 year ago

@drzraf

Hi, you haven't enabled the ability to make issues on your fork so I'm doing this here.

Every time a item is uploaded to Archive.org the "Scanner" tag on the item lists the script name and version:

Scanner:    TubeUp Video Stream Mirroring Application 2023.08.19

BinAnon is requesting you change this, you can make it "tubeup-drzraf" or whatever you want, but not our Scanner ID. Your fork divirges from ours in function significantly, specifically with open yt-dlp flags that allow for non-standard output and we don't want staff or future historians confusing the two. This is for automation of fixing items by IA staff.

Additionally you list a link to our PyPi repo. Please also change that if you're going to fork.

Please change this ASAP.

Thanks.

bibanon / tubeup

Dont leak ip address #310

145, #146, #192