Closed isaaclw closed 7 years ago
This is an issue for the phantomjs
distro on jessie.
I suggest posting these issues on the phantomjs Issues page.
It would be good to resolve this because Raspberry Pi is the natural platform for a script like this. Please keep pushing this and let us know if you find the incantations to get successive phantomjs
RPi processes to close.
I've refactored phantomjs
process control and memory management. Please try the latest commit and see if it fixes the issue you're seeing.
I've verified that it runs faster with a better footprint on my box.
I don't really have a chance to look at it now, but after pip install psutil
I'm now getting this error:
Traceback (most recent call last):eery&safe=active': 0 links added, 20 total
File "isp_data_pollution.py", line 562, in <module>
ISPDataPollution()
File "isp_data_pollution.py", line 167, in __init__
self.pollute_forever()
File "isp_data_pollution.py", line 276, in pollute_forever
self.clear_session()
File "isp_data_pollution.py", line 216, in clear_session
self.session.execute_script('window.localStorage.clear();')
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 484, in execute_script
'args': converted_args})['value']
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'execute_script' from file ':/ghostdriver/./third_party/webdriver-atoms/execute_script.js'
I added process management to address the multiple phantomjs
process issue as well as to achieve some CPU and memory efficiencies.
If necessary, we can fall back to launching one phantomjs
per GET.
But let's make sure were on the same page. What version of phantomjs
and ghostdriver
are in your stack? Here's mine:
python3 -c 'from selenium import webdriver; driver = webdriver.PhantomJS(); print("phantomjs version is {}, ghostdriver version is {}".format(driver.capabilities["version"],driver.capabilities["driverVersion"]))'
phantomjs version is 2.0.0, ghostdriver version is 1.2.0
I had to put it in the script, since I'm running on a headless computer and need the display set up.
My result was:
phantomjs version is 2.1.1, ghostdriver version is 1.2.0
And I got the same crash (after pulling)
Traceback (most recent call last):gar&safe=active': 0 links added, 20 total
File "isp_data_pollution.py", line 585, in <module>
ISPDataPollution()
File "isp_data_pollution.py", line 170, in __init__
self.pollute_forever()
File "isp_data_pollution.py", line 279, in pollute_forever
self.clear_session()
File "isp_data_pollution.py", line 219, in clear_session
self.session.execute_script('window.localStorage.clear();')
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 484, in execute_script
'args': converted_args})['value']
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'execute_script' from file ':/ghostdriver/./third_party/webdriver-atoms/execute_script.js'
It appears that you've installed the libraries in a nonstandard directory, and that ghostdriver
isn't installed in this directory.
Here's the test SSCCE code. Use this to test your install directories.
P.S. Do another pull as well. I put the memory clearing code in a try block.
Something you did seemed to fix it?
It's running now, though I think phantomjs is still periodically hanging... just not spawning a whole lot of processes to take over.
Great! I refactored all phantomjs
-related calls to handle hangs robustly.
I also see phantomjs
hangs occasionally, but that's okay so long as the script correctly detects zombie state, kills the process cleanly, and relaunches a new instance.
Would you please confirm if the script runs successfully for a few days on a Raspberry Pi?
I'd like to add a daemon wrapper and a bandwidth tracking controller once the core functionality is solid on a few platforms.
I'm actually not using a raspberry pi. I have a NAS server (8-core, 16G ram, ~5TB).
But I'm still getting processes like this:
[phantomjs]
Are you using linux? I wish I had more time to debug this.
Edit: one of the phantomjs programs has been defunct for 4 hours, and during that time, there has been no more output with the isp_data_pollution.py script.
Would you please confirm if the script runs successfully for a few days on a Raspberry Pi?
I fired up a raspberry pi also, and unfortunately it looks like I won't be able to test on it. The only version of phantomjs in the raspbian repository is 1.4.1 And for some reason jessie-backports don't work with raspbian.
I'll see if I can find a way to compile a newer version of phantomjs on the raspbian.
But I'm still getting processes like this: [phantomjs]
I'm using BSD. My runs go on for days and crawl up the the maximum link limit.
Would you please run in debug mode and try to isolate the block where phantomjs hangs?
python isp_data_pollution.py -g
I was going to show you the whole log when I woke this morning... but while I was figuring out how to copy/paste in tmux, it managed to overflow past the tmux buffer.
Here's the end of it: https://pastebin.com/8EEgCA7W
A lot of this.
<urlopen error [Errno 111] Connection refused>
Seeding with search for 'squashberry proponent barnstorm Hodgkin'…
.pollute() exception:
I'm not sure if there's any message about why it crashed in an earlier part of the error message that was cleared. I'll re-run the script to get the whole message.
Here's the "full" log. Actually, the full log is 175MB, because I didn't stop it once it disconnected, so the last line set is just repeated, for another 4,640,711 lines...
There's some issue with your phantomjs
install:
.find_element_by_tag_name() exception:
Message: Error - Unable to load Atom 'find_elements' from file ':/ghostdriver/./third_party/webdriver-atoms/find_elements.js'
Have you successfully run the SSCCE?
Also, please use the latest commit—I wrapped all phantomjs
calls within signal timeouts to prevent any hangs.
If you put the SSCCE in the repo, I can add the 'display' parts to it.
I'll test it.
(isp-pollute)isaac@tesla:~/src/isp-data-pollution$ python sscce.py
phantomjs version is 2.1.1, ghostdriver version is 1.2.0
Traceback (most recent call last):
File "sscce.py", line 37, in <module>
for link in [ div.find_element_by_tag_name('a').get_attribute('href') for div in driver.find_elements_by_css_selector('div.g') if div.find_element_by_tag_name('a').get_attribute('href') is not None ]:
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 462, in find_elements_by_css_selector
return self.find_elements(by=By.CSS_SELECTOR, value=css_selector)
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 810, in find_elements
'value': value})['value']
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/isaac/.virtualenvs/isp-pollute/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 164, in check_response
raise exception_class(value)
selenium.common.exceptions.WebDriverException: Message: Error - Unable to load Atom 'find_elements' from file ':/ghostdriver/./third_party/webdriver-atoms/find_elements.js'
You'll need to install a working phantomjs and ghostdriver that are able to parse links from a web page.
Is this still an issue?
I still haven't managed to get this to work on that headless box, since the phantomjs re-write you did.
The re-write seemed to handle culling stale jobs, but now the program doesn't run at all.
I've been too busy to find a version compiled that works.
But anyway, just so you know this still doesn't work for my setup. Which might be abnormal.
(isp-pollute)isaac@tesla:~/src/isp-data-pollution$ python isp_data_pollution.py
This is ISP Data Pollution 🐙💨, Version 1.1
Downloading the blacklist… done.
Display format:
Downloading: website.com; NNNNN links [in library], H(domain)= B bits [entropy]
Downloaded: website.com: +LLL/NNNNN links [added], H(domain)= B bits [entropy]
Traceback (most recent call last):turate Erasmus" …; 20 links, H(domain)=3.8 b …
File "isp_data_pollution.py", line 901, in <module>
ISPDataPollution()
File "isp_data_pollution.py", line 194, in __init__
self.pollute_forever()
File "isp_data_pollution.py", line 377, in pollute_forever
self.seed_links()
File "isp_data_pollution.py", line 446, in seed_links
self.get_websearch(word)
File "isp_data_pollution.py", line 661, in get_websearch
if self.link_count() < self.max_links_cached: self.add_url_links(new_links,url)
File "isp_data_pollution.py", line 765, in add_url_links
self.print_progress(current_url,num_links=k)
File "isp_data_pollution.py", line 776, in print_progress
self.print_truncated_line(url,text_suffix)
File "isp_data_pollution.py", line 793, in print_truncated_line
if len(url) + chars_used > terminal_width:
TypeError: object of type 'NoneType' has no len()
(I installed pyopenssl to meet the additional requirements)
This is a continuation from https://github.com/essandess/isp-data-pollution/issues/15#issuecomment-292563634