Closed vinnytroia closed 4 years ago
Seeing this since last week:
DMArchiver 0.2.5 Running on Python 3.7.4 (default, Jul 9 2019, 18:14:44) [Clang 9.0.0 (clang-900.0.39.2)]
Traceback (most recent call last):
File "dmarchiver/cmdline.py", line 134, in
Using macOS 10.15.3
Same. It’s not just Mac though. I get the same on Linux. Twitter may have changed something.
Same on Windows 10, I hope we'll get a fix of this :(
This fixed it for me. Twitter now shows a splash that asks if you want to be taken to legacy Twitter if your scraper doesn't have JS enabled. We could either run a headless Selenium webdriver (huge headache), or do this. It's not the cleanest solution, but it works, and the session saving still works, so... shrug emoji. I can open a pull, but I imagine @Mincka may want to refactor this a bit instead.
In core.py
, under class Crawler(object):
add _login_headers
beneath _http_headers
:
_http_headers = {
'User-Agent': _user_agent}
_login_headers = {
'User-Agent': _user_agent,
'Referer': 'https://mobile.twitter.com/login'}
add force_nojs
at the start of def authenticate
:
def authenticate(self, username, password, save_session, raw_output):
force_nojs = 'https://mobile.twitter.com/i/nojs_router?path=%2Flogin'
login_url = self._twitter_base_url + '/login'
and this this is the meat of the changes, just a bit further down:
response = self._session.post(
force_nojs,
headers=self._login_headers)
[...]
document = lxml.html.document_fromstring(response.content)
Sending that post request (instead of a get, as before) to the nojs redirect is enough to get what lxml needs to parse. For whatever reason, lxml.html needs response.content instead of response.text, now, too.
I only noticed this issue because I had to change my password overnight so all of my sessions were invalidated. Good reminder to use the session saving feature! (Which I tested to still work, and requires that _http_headers stays static, i.e. no referer in the header.)
Thank you so much @cajuncooks for taking the time to investigate and find a workaround. I've implemented your fix as described and published the version 0.2.6 for all platforms! 🎉
Unfortunately, this issue does not seem to be fixed on my side. No logs are created, just the same error, but on 0.2.6 instead
DMArchiver 0.2.6
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 134, in <module>
File "dmarchiver\cmdline.py", line 97, in main
File "dmarchiver\core.py", line 316, in authenticate
IndexError: list index out of range
Failed to execute script cmdline
Weirdly enough, it is running Python 3.4.4, while I only have 3.9.4 installed.
DMArchiver 0.2.5 Running on Python 3.6.8 (default, Aug 7 2019, 17:28:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Traceback (most recent call last): File "/usr/local/bin/dmarchiver", line 11, in
load_entry_point('dmarchiver==0.2.5', 'console_scripts', 'dmarchiver')()
File "/usr/local/lib/python3.6/site-packages/dmarchiver-0.2.5-py3.6.egg/dmarchiver/cmdline.py", line 97, in main
File "/usr/local/lib/python3.6/site-packages/dmarchiver-0.2.5-py3.6.egg/dmarchiver/core.py", line 312, in authenticate
IndexError: list index out of range
any idea what might be causing this?
i have tried to circumvent this by specifying the id as a paremter, but it does not help.
thanks