essandess / isp-data-pollution

ISP Data Pollution to Protect Private Browsing History with Obfuscation
MIT License
590 stars 53 forks source link

phantomjs instances don't seem to be closing #21

Closed isaaclw closed 7 years ago

isaaclw commented 7 years ago

This is a continuation from https://github.com/essandess/isp-data-pollution/issues/15#issuecomment-292563634

$ lsb_release -da
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.7 (jessie)
Release:        8.7
Codename:       jessie
$ apt-cache policy phantomjs
phantomjs:
  Installed: 2.1.1+dfsg-2~bpo8+1
  Candidate: 2.1.1+dfsg-2~bpo8+1
  Version table:
 *** 2.1.1+dfsg-2~bpo8+1 0
        100 http://ftp.us.debian.org/debian/ jessie-backports/main amd64 Packages
        100 /var/lib/dpkg/status
essandess commented 7 years ago

This is an issue for the phantomjs distro on jessie.

I suggest posting these issues on the phantomjs Issues page.

It would be good to resolve this because Raspberry Pi is the natural platform for a script like this. Please keep pushing this and let us know if you find the incantations to get successive phantomjs RPi processes to close.

essandess commented 7 years ago

I've refactored phantomjs process control and memory management. Please try the latest commit and see if it fixes the issue you're seeing.

I've verified that it runs faster with a better footprint on my box.

isaaclw commented 7 years ago

I don't really have a chance to look at it now, but after pip install psutil I'm now getting this error:

Traceback (most recent call last):eery&safe=active': 0 links added, 20 total
  File "isp_data_pollution.py", line 562, in <module>
    ISPDataPollution()
  File "isp_data_pollution.py", line 167, in __init__
    self.pollute_forever()
  File "isp_data_pollution.py", line 276, in pollute_forever
    self.clear_session()
  File "isp_data_pollution.py", line 216, in clear_session
    self.session.execute_script('window.localStorage.clear();')
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 484, in execute_script
    'args': converted_args})['value']
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
    self.error_handler.check_response(response)
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
    raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'execute_script' from file ':/ghostdriver/./third_party/webdriver-atoms/execute_script.js'
essandess commented 7 years ago

I added process management to address the multiple phantomjs process issue as well as to achieve some CPU and memory efficiencies.

If necessary, we can fall back to launching one phantomjs per GET.

But let's make sure were on the same page. What version of phantomjs and ghostdriver are in your stack? Here's mine:

python3 -c 'from selenium import webdriver; driver = webdriver.PhantomJS(); print("phantomjs version is {}, ghostdriver version is {}".format(driver.capabilities["version"],driver.capabilities["driverVersion"]))'
phantomjs version is 2.0.0, ghostdriver version is 1.2.0
isaaclw commented 7 years ago

I had to put it in the script, since I'm running on a headless computer and need the display set up.

My result was: phantomjs version is 2.1.1, ghostdriver version is 1.2.0

isaaclw commented 7 years ago

And I got the same crash (after pulling)

Traceback (most recent call last):gar&safe=active': 0 links added, 20 total
  File "isp_data_pollution.py", line 585, in <module>
    ISPDataPollution()
  File "isp_data_pollution.py", line 170, in __init__
    self.pollute_forever()
  File "isp_data_pollution.py", line 279, in pollute_forever
    self.clear_session()
  File "isp_data_pollution.py", line 219, in clear_session
    self.session.execute_script('window.localStorage.clear();')
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 484, in execute_script
    'args': converted_args})['value']
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
    self.error_handler.check_response(response)
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
    raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'execute_script' from file ':/ghostdriver/./third_party/webdriver-atoms/execute_script.js'
essandess commented 7 years ago

It appears that you've installed the libraries in a nonstandard directory, and that ghostdriver isn't installed in this directory.

Here's the test SSCCE code. Use this to test your install directories.

P.S. Do another pull as well. I put the memory clearing code in a try block.

isaaclw commented 7 years ago

Something you did seemed to fix it?

It's running now, though I think phantomjs is still periodically hanging... just not spawning a whole lot of processes to take over.

essandess commented 7 years ago

Great! I refactored all phantomjs-related calls to handle hangs robustly.

I also see phantomjs hangs occasionally, but that's okay so long as the script correctly detects zombie state, kills the process cleanly, and relaunches a new instance.

Would you please confirm if the script runs successfully for a few days on a Raspberry Pi?

I'd like to add a daemon wrapper and a bandwidth tracking controller once the core functionality is solid on a few platforms.

isaaclw commented 7 years ago

I'm actually not using a raspberry pi. I have a NAS server (8-core, 16G ram, ~5TB).

But I'm still getting processes like this: [phantomjs]

Are you using linux? I wish I had more time to debug this.

Edit: one of the phantomjs programs has been defunct for 4 hours, and during that time, there has been no more output with the isp_data_pollution.py script.

isaaclw commented 7 years ago

Would you please confirm if the script runs successfully for a few days on a Raspberry Pi?

I fired up a raspberry pi also, and unfortunately it looks like I won't be able to test on it. The only version of phantomjs in the raspbian repository is 1.4.1 And for some reason jessie-backports don't work with raspbian.

I'll see if I can find a way to compile a newer version of phantomjs on the raspbian.

essandess commented 7 years ago

But I'm still getting processes like this: [phantomjs]

I'm using BSD. My runs go on for days and crawl up the the maximum link limit.

Would you please run in debug mode and try to isolate the block where phantomjs hangs?

python isp_data_pollution.py -g
isaaclw commented 7 years ago

I was going to show you the whole log when I woke this morning... but while I was figuring out how to copy/paste in tmux, it managed to overflow past the tmux buffer.

Here's the end of it: https://pastebin.com/8EEgCA7W

A lot of this.

<urlopen error [Errno 111] Connection refused>
Seeding with search for 'squashberry proponent barnstorm Hodgkin'…
.pollute() exception:

I'm not sure if there's any message about why it crashed in an earlier part of the error message that was cleared. I'll re-run the script to get the whole message.

isaaclw commented 7 years ago

Here's the "full" log. Actually, the full log is 175MB, because I didn't stop it once it disconnected, so the last line set is just repeated, for another 4,640,711 lines...

https://pastebin.com/hZmpENfX

essandess commented 7 years ago

There's some issue with your phantomjs install:

.find_element_by_tag_name() exception:
Message: Error - Unable to load Atom 'find_elements' from file ':/ghostdriver/./third_party/webdriver-atoms/find_elements.js'

Have you successfully run the SSCCE?

Also, please use the latest commit—I wrapped all phantomjs calls within signal timeouts to prevent any hangs.

isaaclw commented 7 years ago

If you put the SSCCE in the repo, I can add the 'display' parts to it.

I'll test it.

isaaclw commented 7 years ago
(isp-pollute)isaac@tesla:~/src/isp-data-pollution$ python sscce.py
phantomjs version is 2.1.1, ghostdriver version is 1.2.0
Traceback (most recent call last):
  File "sscce.py", line 37, in <module>
    for link in [ div.find_element_by_tag_name('a').get_attribute('href') for div in driver.find_elements_by_css_selector('div.g') if div.find_element_by_tag_name('a').get_attribute('href') is not None ]:
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 462, in find_elements_by_css_selector
    return self.find_elements(by=By.CSS_SELECTOR, value=css_selector)
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 810, in find_elements
    'value': value})['value']
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
    self.error_handler.check_response(response)
  File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
    raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'find_elements' from file ':/ghostdriver/./third_party/webdriver-atoms/find_elements.js'
essandess commented 7 years ago

You'll need to install a working phantomjs and ghostdriver that are able to parse links from a web page.

essandess commented 7 years ago

Is this still an issue?

isaaclw commented 7 years ago

I still haven't managed to get this to work on that headless box, since the phantomjs re-write you did.

The re-write seemed to handle culling stale jobs, but now the program doesn't run at all.

I've been too busy to find a version compiled that works.

But anyway, just so you know this still doesn't work for my setup. Which might be abnormal.

(isp-pollute)isaac@tesla:~/src/isp-data-pollution$ python isp_data_pollution.py
This is ISP Data Pollution 🐙💨, Version 1.1
Downloading the blacklist… done.
Display format:
Downloading: website.com; NNNNN links [in library], H(domain)= B bits [entropy]
Downloaded:  website.com: +LLL/NNNNN links [added], H(domain)= B bits [entropy]

Traceback (most recent call last):turate Erasmus" …; 20 links, H(domain)=3.8 b …
  File "isp_data_pollution.py", line 901, in <module>
    ISPDataPollution()
  File "isp_data_pollution.py", line 194, in __init__
    self.pollute_forever()
  File "isp_data_pollution.py", line 377, in pollute_forever
    self.seed_links()
  File "isp_data_pollution.py", line 446, in seed_links
    self.get_websearch(word)
  File "isp_data_pollution.py", line 661, in get_websearch
    if self.link_count() < self.max_links_cached: self.add_url_links(new_links,url)
  File "isp_data_pollution.py", line 765, in add_url_links
    self.print_progress(current_url,num_links=k)
  File "isp_data_pollution.py", line 776, in print_progress
    self.print_truncated_line(url,text_suffix)
  File "isp_data_pollution.py", line 793, in print_truncated_line
    if len(url) + chars_used > terminal_width:
TypeError: object of type 'NoneType' has no len()

(I installed pyopenssl to meet the additional requirements)

essandess commented 7 years ago

This appears to be the same issue as before: phantomjs isn't working. Have you verified that the basic SSCCE works? Crawling won't work unless phantomjs does.