firecat53 / urlscan

Mutt and terminal url selector (similar to urlview)
GNU General Public License v2.0
214 stars 38 forks source link

IndexError: list index out of range #37

Closed tels7ar closed 5 years ago

tels7ar commented 7 years ago

I get this fairly frequently when running urlscan from mutt:

  File "/usr/local/bin/urlscan", line 4, in <module>
    __import__('pkg_resources').run_script('urlscan==0.8.3', 'urlscan')
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/EGG-INFO/scripts/urlscan", line 161, in <module>
    compact_mode=args.compact)
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlchoose.py", line 132, in __init__
    compact_mode)
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlchoose.py", line 57, in process_urls
    for group, usedfirst, usedlast in extractedurls:
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlscan.py", line 430, in msgurls
    for chunk in msgurls(part, urlidx):
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlscan.py", line 440, in msgurls
    for chunk in extracthtmlurls(msg):
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlscan.py", line 408, in extracthtmlurls
    c.feed(s)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/html/parser.py", line 111, in feed
    self.goahead(0)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/html/parser.py", line 163, in goahead
    self.handle_data(unescape(rawdata[i:j]))
  File "/usr/local/lib/python3.6/site-packages/urlscan-0.8.3-py3.6.egg/urlscan/urlscan.py", line 200, in handle_data
    if self.anchor_stack[-1] is None:
IndexError: list index out of range
firecat53 commented 7 years ago

Hey, can you give me some info about how/when it crashes?

Thanks!

tels7ar commented 7 years ago

I've seen it happen with multiple emails, and it's 100% reproducible - if it crashes on a certain email it always crashes on that email.

Unfortunately the one email I have right now exhibiting the behavior is company confidential.

I'm not doing anything with language or encoding and my terminal is en_US.UTF-8. I set my terminal to screen-256color because I'm using screen. This happens in both iterm2 and terminal on my mac.

Can you give me any debugging tips or is there anything I can turn on to get more info? I'm unfortunately not much of a python expert.

firecat53 commented 7 years ago

Unfortunately, I've had situations like this before where it's a particular character in certain emails that triggers the crash. If there's any way you can find one of the other emails that trigger it, or strip all the confidential information out and email me a tarred version (has to be tarred up, or the process or forwarding the email usually 'fixes' the problem on my end so I can't see it), that would be great, as that's the only way I'll be able to troubleshoot.

Urlscan is still a bit fragile when handling foreign characters/encodings.

Thanks, Scott

dilawar commented 7 years ago

I can confirm. I also get this error today. Often happens when an email is sent in HTML format by my pager converts it to html (I am using lynx or elinks to view, error with both).

I dont think there is any foreign character in the email.

firecat53 commented 7 years ago

Again, if someone can tar up the offending email and send it to me, that's the only way I can troubleshoot this. You can copy the actual email file (if you're using maildir) and edit out any private information first. Just make sure before you tar it up that running urlscan on that file directly from the command line (urlscan <path/to/email>) still causes the error.

firecat53 commented 6 years ago

Anyone able to send me a tarred up email that reproduces this error? I use urlscan daily with mutt and haven't seen this at all. Thanks!

firecat53 commented 6 years ago

Closing. If you see this again, please try to tar a sanitized email and send it to me for troubleshooting. I haven't seen the error and I can't reproduce it. Thanks!

sebastianschauenburg commented 5 years ago

@firecat53 Just came across this error. Attaching a somewhat sanitized email and the error itself.

Traceback (most recent call last): File "/usr/bin/urlscan", line 134, in compact_mode=args.compact) File "/usr/lib/python3/dist-packages/urlscan/urlchoose.py", line 133, in init compact_mode) File "/usr/lib/python3/dist-packages/urlscan/urlchoose.py", line 58, in process_urls for group, usedfirst, usedlast in extractedurls: File "/usr/lib/python3/dist-packages/urlscan/urlscan.py", line 424, in msgurls for chunk in extracthtmlurls(msg): File "/usr/lib/python3/dist-packages/urlscan/urlscan.py", line 392, in extracthtmlurls c.feed(s) File "/usr/lib/python3.7/html/parser.py", line 111, in feed self.goahead(0) File "/usr/lib/python3.7/html/parser.py", line 173, in goahead k = self.parse_endtag(i) File "/usr/lib/python3.7/html/parser.py", line 421, in parse_endtag self.handle_endtag(elem) File "/usr/lib/python3/dist-packages/urlscan/urlscan.py", line 186, in handle_endtag del self.list_stack[-1] IndexError: list assignment index out of range

2019-07-25_overheid.nl_email6.txt

firecat53 commented 5 years ago

Thanks! I'll reopen this. Would it be possible to edit the original email to sanitize it, make sure the sanitized copy still reproduces the error, and then put that into a tar archive and send it or post it? I can't reproduce the error with the file you attached, but I'm not surprised because just the act of uploading it typically cleans the offending character. Putting it into a tar archive should preserve the error.

Thanks!

sebastianschauenburg commented 5 years ago

Thanks for reopening :-)

I've had a deeper look into this. The problem is not with the e-mail (I was sure of that). Ubuntu (and Debian) ship an outdated package of urlscan (0.8.2). That's the issue and that's why you can't reproduce it.

Updating to urlscan 0.9.3 (with pip) and manually installing the corresponding 0.9.3 bin from git, almost work. I do get this error message:

ImportError: cannot import name DEVNULL

Which python package should be installed to fix this? P.S.: My workaround currently is modifying urlscan.py to remove that included module.

Edit: this error only applies when python2 is being used. When forcing bin/urlscan to use python3 (instead of using env python), this is not an issue. Might be useful to force python3 usage?

firecat53 commented 5 years ago

I'm going to deprecate python 2 here shortly. I missed that DEVNULL isn't available with 2.7. Glad you figured it out though!

doa379 commented 4 years ago

No fixed yet.

firecat53 commented 4 years ago

No fixed yet.

Please provide information about urlscan version, python version, actual error and a sanitized and tarred copy of the email that caused the error.

pheuberger commented 4 years ago

@firecat53: Can you give a recommendation how we best use python 3? I installed it via Homebrew and in the hashbang it specifically calls python 2.7 and that crashes urlscan for me every time.

firecat53 commented 4 years ago

@pheuberger For consistency with my other projects, I'm going to change the script shebang to specifically call python 3. I'll put out a new release soon.

Also, please open a new issue when you have questions instead of adding to an existing one :smile: