StevenBlack / hosts

πŸ”’ Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
25.71k stars 2.15k forks source link

Possible fix of the encoding and/or downlaod issue(s) #520

Closed funilrys closed 6 years ago

funilrys commented 6 years ago

Hello, to all member of this community!

After a long time of commenting, watching, reading and learning from issues posted here it's time for me to propose the following patches/commits.

It's been a long time since I wanted to search a possible solution to this recurrent issue and here's my proposition.

Basically, since I consider that many encoding issues are because upstream maintainer does not convert/encode the domains which rely on IDN homograph attack, I decided to read all line and convert/encode all domain using str.encode('IDNA') (I do not change original structure).

Because my main objective was to fix @notDavid's protocol which is the only one I was able to reproduce, I introduced BeautifulSoup() to do the decoding of the downloaded data before we read each line and encode/decode.

About the tests

Before anything, sorry for adding the fix tests issues commits but I wanted to fix those before submitting my PR. You can find the tests I used as the reference to committing those change at https://travis-ci.org/funilrys/hosts.

Please note that I have to submit an issue to upstream but if you see the following or related when runing the tests under Python >= 3 it's not a local issue.

/home/travis/miniconda3/envs/hosts/lib/python3.5/site-packages/bs4/builder/_lxml.py:250: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead

About requirements*.txt

This submission also introduces requirements.txt files which I generally use to install dependencies with pip install -r requirements.txt (I'm almost everytime under virtualenv).

As I did not get an answer (cf) I took the initiative to includes them anyway.

Please note that the requirements*.txt files were generated by pipreqs.

So to quote my commit message

Please note that this patch also introduces domain_to_idna() which is in charge of converting a domain in a line into IDNA and/or UTF-8 format.

Also, note the introduction of BeautifulSoup() which helps us to decode data from the downloaded URL.

Fixes (issue(s)/protocol(s) I was able to reproduce):

Possible fix of (issue(s)/protocol(s) I wasn't able to reproduce):

Footnote(s)

Sorry for the typo or bad English structure in my commits, code or this description but English is still a language I learn.

So I'm open to any pieces of advice and others around my English, coding style or other :smile_cat:

Cheers, Nissar.

welcome[bot] commented 6 years ago

Thank you for submitting this pull request! We’ll get back to you as soon as we can!

StevenBlack commented 6 years ago

@funilrys I'm getting crashes when I issue

pip install -r requirements.txt

and

pip install -r requirements_python2.txt

In the first case:

$ pip install -r requirements.txt
Collecting lxml==4.1.1 (from -r requirements.txt (line 1))
  Downloading lxml-4.1.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8.7MB 196kB/s
Collecting beautifulsoup4==4.6.0 (from -r requirements.txt (line 2))
  Downloading beautifulsoup4-4.6.0-py2-none-any.whl (86kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 92kB 7.9MB/s
Collecting mock==2.0.0 (from -r requirements.txt (line 3))
  Downloading mock-2.0.0-py2.py3-none-any.whl (56kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 61kB 6.8MB/s
Collecting pbr>=0.11 (from mock==2.0.0->-r requirements.txt (line 3))
  Downloading pbr-3.1.1-py2.py3-none-any.whl (99kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 102kB 10.6MB/s
Collecting funcsigs>=1; python_version < "3.3" (from mock==2.0.0->-r requirements.txt (line 3))
  Downloading funcsigs-1.0.2-py2.py3-none-any.whl
Collecting six>=1.9 (from mock==2.0.0->-r requirements.txt (line 3))
  Downloading six-1.11.0-py2.py3-none-any.whl
Installing collected packages: lxml, beautifulsoup4, pbr, funcsigs, six, mock
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/Library/Python/2.7/site-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/Library/Python/2.7/site-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/lxml-4.1.1.dist-info'

In the second case:

$ pip install -r requirements_python2.txt
Collecting mock==2.0.0 (from -r requirements_python2.txt (line 1))
  Using cached mock-2.0.0-py2.py3-none-any.whl
Collecting lxml==4.1.1 (from -r requirements_python2.txt (line 2))
  Using cached lxml-4.1.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting beautifulsoup4==4.6.0 (from -r requirements_python2.txt (line 3))
  Using cached beautifulsoup4-4.6.0-py2-none-any.whl
Collecting pbr>=0.11 (from mock==2.0.0->-r requirements_python2.txt (line 1))
  Using cached pbr-3.1.1-py2.py3-none-any.whl
Collecting funcsigs>=1; python_version < "3.3" (from mock==2.0.0->-r requirements_python2.txt (line 1))
  Using cached funcsigs-1.0.2-py2.py3-none-any.whl
Collecting six>=1.9 (from mock==2.0.0->-r requirements_python2.txt (line 1))
  Using cached six-1.11.0-py2.py3-none-any.whl
Installing collected packages: pbr, funcsigs, six, mock, lxml, beautifulsoup4
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 784, in install
    **kwargs
  File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 851, in install
    self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 1064, in move_wheel_files
    isolated=self.isolated,
  File "/Library/Python/2.7/site-packages/pip/wheel.py", line 345, in move_wheel_files
    clobber(source, lib_dir, True)
  File "/Library/Python/2.7/site-packages/pip/wheel.py", line 316, in clobber
    ensure_dir(destdir)
  File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 83, in ensure_dir
    os.makedirs(path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/pbr-3.1.1.dist-info'

Take note that sudo does not help in either case.

StevenBlack commented 6 years ago

@funilrys some more feedback: Since pip install -r requirements... is a prerequisite, that should be documented in the readme_template.md file in the Generate your own unified hosts file section.

FadeMind commented 6 years ago

Working like a charm. @funilrys Thanks!!

funilrys commented 6 years ago

Hello Steven @StevenBlack,

Before anything please avoid sudo pip as it may damage the python package(s) infrastructure of your system.

I thought that we all assume that virtualenv is always a good practice as you use a virtual environment (conda environment) to launch the tests.

So to fix such issue we have 3 possibilities:

Install dependencies from your package manager

For ubuntu

$ apt-get install python-lxml python3-lxml python-bs4 python3-bs4

For Arch Linux (the only one I use)

$ pacman -Syyu python-lxml python2-lxml python-beautifulsoup4 python2-beautifulsoup4

Then you simply run

$ python updateHostsFile.py -a

Install with pip along with --user (without virtualenv)

This method is always preferred over sudo pip for the reason I mentioned before. To quote pip documentation :

--user: Install to the Python user install directory for your platform. Typically ~/.local/, or %APPDATA%Python on Windows. (See the Python documentation for site.USER_BASE for full details.)

$ pip3 install -r requirements.txt --user
$ pip2 install -r requirements_python2.txt --user
$ python updateHostsFile.py -a

Use virtualenv

To quote Kenneth Reitz (Python Overlord at Heroku and Director at the PSF):

virtualenv is a tool to create isolated Python environments. virtualenv creates a folder which contains all the necessary executables to use the packages that a Python project would need.

Here's the procedure:

$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
$ python updateHostsFile.py -a
$ deactivate # to exit virtualenv

Now before I or we edit the readme_template.md you have to make a choice ... which one do we recommend to users?

rautamiekka commented 6 years ago

Before anything please avoid sudo pip as it may damage the python package(s) infrastructure of your system.

Not much of a choice when you need newer packages.

funilrys commented 6 years ago

@rautamiekka sorry but no ... I proposed 3 alternatives to sudo pip ...

funilrys commented 6 years ago

Unless you are as root into a server or a container you should absolutely avoid such things.. Unless you like to break your system to learn like me sometimes...

A recent example of why you should not run sudo combined with command you think you can trust is this and its detail here.

StevenBlack commented 6 years ago

@funilrys we need a solution that works on MacOS and the recent versions of Windows too.

funilrys commented 6 years ago

We then have those two solutions:

gfyoung commented 6 years ago

@funilrys :

1) sudo pip is not that too scary IMO, so @StevenBlack , I think that's acceptable.

2) Add unittests for your added function.

funilrys commented 6 years ago

@gfyoung sudo npm was not too scary too ...

Will add unit tests once I'm back home.

gfyoung commented 6 years ago

sudo npm was not too scary too ...

It isn't when you can remove things from your package.json πŸ˜„

notDavid commented 6 years ago

Hi there, this works for me too on macOS πŸ‘

One minor issue though, i was curious what hostnames were causing the problem, and i notice that with https://github.com/StevenBlack/hosts/pull/520 line 147886 in https://hosts-file.net/psh.txt :

www.hualaΓ±e.cl is being downloaded/converted to www.xn--hualae-0wa.cl

so basically this host is not being blocked... (don't shoot the messenger... πŸ€·πŸΌβ€β™‚οΈ 😁)

Probably obvious, but you can reproduce this by doing this, + apply this pull request, and then in step 8 instead of mv psh.txt hosts do cp psh.txt hosts. Then after running step9, compare file psh.txt with hosts.

funilrys commented 6 years ago

@notDavid www.xn--hualae-0wa.cl == www.hualaΓ±e.cl we are talking about the same domain...

Please read more about Punycode. You can also verify with Punycoder.

notDavid commented 6 years ago

@funilrys Ahhhhhh i see!

funilrys commented 6 years ago

@gfyoung Please find now the new version of the new function and/or the tests along with it :+1: @StevenBlack @gfyoung So what should we write into the readme_template.md? :thinking:

StevenBlack commented 6 years ago

@funilrys yes please. Since pip install -r requirements... is a prerequisite, that should be documented in the readme_template.md file in the Generate your own unified hosts file section.

StevenBlack commented 6 years ago

Nissar @funilrys following the instructions in the latest readme_template, I get the same errors on all my Macs.

So this leads to another request, please.

ATM I feel like we should just recommend pip install --user -r requirements... to be safe.

StevenBlack commented 6 years ago

Nissar @funilrys another minor detail. Lower in the Generate your own unified hosts file section we give instructions for Python3, then Python2.7. Can you reorder the pip install... stuff to please lead with Python3?

StevenBlack commented 6 years ago

Nissar @funilrys I think I've found an issue.

Background: I use both Python3 and Python2.7.

The first thing I did was: pip install --user -r requirements_python2.txt

Result: \o/ yay!

Then I did: pip install --user -r requirements.txt

Result: πŸ€”

Requirement already satisfied: lxml==4.1.1 in /Users/steve/Library/Python/2.7/lib/python/site-packages (from -r requirements.txt (line 1))
Requirement already satisfied: beautifulsoup4==4.6.0 in /Users/steve/Library/Python/2.7/lib/python/site-packages (from -r requirements.txt (line 2))
Requirement already satisfied: mock==2.0.0 in /Users/steve/Library/Python/2.7/lib/python/site-packages (from -r requirements.txt (line 3))
Requirement already satisfied: pbr>=0.11 in /Library/Python/2.7/site-packages (from mock==2.0.0->-r requirements.txt (line 3))
Requirement already satisfied: funcsigs>=1; python_version < "3.3" in /Library/Python/2.7/site-packages (from mock==2.0.0->-r requirements.txt (line 3))
Requirement already satisfied: six>=1.9 in /Users/steve/Library/Python/2.7/lib/python/site-packages (from mock==2.0.0->-r requirements.txt (line 3))

Then when I do: python updateHostsFile.py

Result: \o/ yay!

But now when I do: python3 updateHostsFile.py

Result, in total:

Traceback (most recent call last):
  File "updateHostsFile.py", line 27, in <module>
    import lxml  # noqa: F401
ModuleNotFoundError: No module named 'lxml'

... and nothing generates.

Is there a problem with lxmx such that we need to import differently for Python3 and Python2.7?

StevenBlack commented 6 years ago

Nissar @funilrys ... or do I need to invoke pip differently for Python3, so the packages don't go into /Users/steve/Library/Python/2.7/lib/... which appears to be the problem here.

If so, then that needs to be documented too.

Edit: See https://stackoverflow.com/questions/11268501/how-to-use-pip-with-python-3-x-alongside-python-2-x

rautamiekka commented 6 years ago

^ Yeah, pip != pip3 ...

StevenBlack commented 6 years ago

So Nissar @funilrys I think the docs should say pip3 install ... and this, simply this, will steer-away most of the potential support load arising from pip.

StevenBlack commented 6 years ago

Nissar @funilrys this is an excellent PR. Just those doc changes and we'll merge!

funilrys commented 6 years ago

Steven @StevenBlack What I see here is another problem! Some OS have python2 as python or python3 as python so as pip2 and pip3 exists I'll put them instead of pip as pip2 as you may have suggested :)

funilrys commented 6 years ago

@StevenBlack Can you confirm the existence of pip2 on macOS? Because I can't consider what I said previously (as an Arch Linux user) as granted for every OS...

StevenBlack commented 6 years ago

Nissar @funilrys yup, it looks native. In my case, on the office iMac:

$ which pip2
/usr/local/bin/pip2

$ which pip
/usr/local/homebrew/bin/pip

$ which pip3
/usr/local/homebrew/bin/pip3
funilrys commented 6 years ago

Okay Steven @StevenBlack then I'm going to edit readme_template.md as follow:

Note if you are using Python 3, please install the dependencies with:

  pip3 install --user -r requirements.txt

Note if you are using Python 2, please install the dependencies with:

  pip2 install --user -r requirements_python2.txt

I do prefer to recommend with --user and let the user know more about that --user with the following as note (please correct my grammar or complete sentence if needed):

As we prefer not to break your system, we recommend the usage of --user which install the required dependencies at the user level. More information about it can be found on pip documentation.

FadeMind commented 6 years ago

@StevenBlack Manjaro Linux:

[fademind@manjaro ~]$ which pip
/usr/bin/pip
[fademind@manjaro ~]$ which pip2
/usr/bin/pip2
[fademind@manjaro ~]$ which pip3
/usr/bin/pip3
[fademind@manjaro ~]$ export LANG=C;pacman -Qo /usr/bin/pip /usr/bin/pip2 /usr/bin/pip3
/usr/bin/pip is owned by python-pip 9.0.1-3
/usr/bin/pip2 is owned by python2-pip 9.0.1-3
/usr/bin/pip3 is owned by python-pip 9.0.1-3

Aaand packages:

[fademind@manjaro ~]$ pacman -Q|grep python
python 3.6.4-2
python-appdirs 1.4.3-1
python-beautifulsoup4 4.6.0-1
python-cairo 1.16.2-1
python-chardet 3.0.4-1
python-dbus 1.2.6-1
python-dbus-common 1.2.6-1
python-docopt 0.6.2-4
python-gobject 3.26.1-1
python-idna 2.6-1
python-jade-application-kit 0.a34-1
python-jinja 2.10-1
python-keyutils 0.5-1
python-lxml 4.1.1-1
python-lxml-docs 4.1.1-1
python-markupsafe 1.0-1
python-mock 2.0.0-2
python-npyscreen 4.10.5-1
python-olefile 0.45.1-1
python-packaging 16.8-2
python-pbr 3.1.1-1
python-pillow 5.0.0-1
python-pip 9.0.1-3
python-pycups 1.9.73-3
python-pycurl 7.43.0.1-1
python-pygments 2.2.0-1
python-pyparsing 2.2.0-1
python-pyqt5 5.10-3
python-pysmbc 1.0.15.8-1
python-reportlab 3.4.0-1
python-requests 2.18.4-1
python-setuptools 1:38.5.1-1
python-sip 4.19.7-1
python-six 1.11.0-1
python-urllib3 1.22-1
python-yaml 3.12-3
python2 2.7.14-2
python2-appdirs 1.4.3-1
python2-beautifulsoup4 4.6.0-1
python2-cairo 1.16.2-1
python2-funcsigs 1.0.2-1
python2-gobject2 2.28.7-1
python2-lxml 4.1.1-1
python2-mock 2.0.0-2
python2-packaging 16.8-2
python2-pbr 3.1.1-1
python2-pip 9.0.1-3
python2-pyparsing 2.2.0-1
python2-setuptools 1:38.5.1-1
python2-six 1.11.0-1
StevenBlack commented 6 years ago

Nissar @funilrys perfect!

StevenBlack commented 6 years ago

@funilrys a small suggestion.

We recommend the --user flag which installs the required dependencies at the user level. More information about it can be found on pip documentation.

StevenBlack commented 6 years ago

Nissar @funilrys still missing one thing: To run unit tests, in the top level directory, just run:... should come after installing the requirements since latest tests will fail without the requirements.

funilrys commented 6 years ago

That's logical sorry ...

StevenBlack commented 6 years ago

Merging. Thanks so much Nissar @funilrys and everyone who provided input on this!

welcome[bot] commented 6 years ago

Congratulations on merging your first pull request! πŸŽ‰πŸŽ‰πŸŽ‰

ScriptTiger commented 6 years ago

Perhaps a note of some kind should also be added in a different PR for those installing the dependencies manually for whatever reason. Beautiful Soup (which goes to version 3 and only supports Python 2) is actually a different package from Beautiful Soup 4 (the bs4 packages for Python 2 and 3). It might be different depending on how the repositories manage them, but that at least seems to be the author's intent and what I have seen in the repos I use. So if they install the latest Beautiful Soup package and don't know why it's throwing an error when it can't find the bs4 module, that's why.