dhellmann / link_scrubber

Bookmark cleaner
7 stars 2 forks source link

UnicodeDecodeErrors on some bookmarks #3

Open Mekk opened 10 years ago

Mekk commented 10 years ago

While processing some of my bookmarks I get the following error.

Exception in thread check-bookmarks-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/linkscrubber/cmd/redirects.py", line 169, in _check_bookmarks_worker
    (bm['href'], bm['description'], err))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 186: ordinal not in range(128)

I am not sure about the reason (it happens on minority of bookmarks, those mentioned by „changing ... to ...” messages above errors seem rather innocent and not different than those succesfully processed). Initially I suspected that using Polish characters in descriptions may matter, but some errorenouse bookmarks are english, and some polish are properly processed). At least if error is related to bookmark mentioned directly before it.

dhellmann commented 10 years ago

I wonder if the error is in the processing of the return value. Could you send me one bookmark that shows this problem, with the description and tags you use, so I can add it to my test account to play around with it?

Mekk commented 10 years ago

I have problems reproducing this case as I get another error, which I will report separately (or maybe because your script fixed some troublesome bookmark before failing, can't say).

I will retry on the same machine as previously, mayhaps it will change sth.

Mekk commented 10 years ago

OK, here it reappears. Not sure which bookmark is causing the problem, but here is transcript of non-debug run:

processing http://feeds.dashes.com/~r/AnilDash/~3/Nt55jYUwfIk/if-your-websites-full-of-assholes-its-your-fault.html (If your website's full of assholes, it's your fault)
processing http://feeds.technologyreview.com/click.phdo?i=9fda9da68286d3c70caa47b7fe70d78f (Blog - Pioneer Anomaly Solved By 1970s Computer Graphics Technique)
processing http://feeds.bradkellett.com/~r/bradkellett/~3/DpaZxsCbbOs/ (PlaceWidget and Cross-Domain iFrame Javascript)
processing http://feedproxy.google.com/~r/PawelTkaczyk/~3/yhJ_rZPUmJ8/ (O płaceniu za informacje w internecie)
processing http://feedproxy.google.com/~r/PawelTkaczyk/~3/QbR8CiMIclE/ (Marketing case study: Rebranding IGAZ)
processing http://feedproxy.google.com/~r/PawelTkaczyk/~3/QbR8CiMIclE (Marketing case study: Rebranding IGAZ)
processing http://feedproxy.google.com/~r/PawelTkaczyk/~3/uTzBb9NMreA/ (Podcast „Mała Wielka Firma”, odc. 22: Dzielenie się wiedzą)
processing http://feedproxy.google.com/~r/catonmat/~3/nPBz088G5cQ/ (ldd arbitrary code execution)
processing http://feeds.webaudit.pl/~r/WebauditBlog/~3/xvRI0ZwOA3c/ (ABC marketingu wirusowego: Łącznicy, znawcy rynku i sprzedawcy)
processing http://feeds.harvardbusiness.org/~r/harvardbusiness/~3/h1UJ5CwB5VI/when-a-colleagues-mistakes-aff.html (When a Colleague's Mistakes Affect You)
processing http://feeds.sfgate.com/click.phdo?i=277f7d373a1c526994eeef01a0c7bc5c (Crap Detection 101)
processing http://feeds.bradkellett.com/~r/bradkellett/~3/W7rwwsRkFZw/ (Review of the Fever RSS Reader)
processing http://feedproxy.google.com/~r/jayfields/mjKQ/~3/CIKOGOXYrb8/programmer-confidence-and-arrogance.html (Programmer Confidence and Arrogance)
processing http://feeds.mekk.waw.pl/NotatnikZapisywanyWieczorami (Notatnik zapisywany wieczorami)
processing http://feeds.mekk.waw.pl/MekksBlog?format=xml (Mekk's blog)
processing http://feeds.feedburner.com/~u/Mekkk (FeedBulletin for: Mekkk)
processing http://feeds.feedburner.com/support/uwe-allposts (FeedBurner Support)
processing http://feeds.feedburner.com/JungleDisk (Jungle Disk)
processing http://feeds.feedburner.com/CssTricks (CSS-Tricks)
processing http://feeds.feedburner.com/Kiwitobescom (kiwitobes.com)
found 20 posts to process
Exception in thread check-bookmarks-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/linkscrubber/cmd/redirects.py", line 169, in _check_bookmarks_worker
    (bm['href'], bm['description'], err))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 186: ordinal not in range(128)

changing http://feeds.dashes.com/~r/AnilDash/~3/Nt55jYUwfIk/if-your-websites-full-of-assholes-its-your-fault.html to http://dashes.com/anil/2011/07/if-your-websites-full-of-assholes-its-your-fault.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AnilDash+%28Anil+Dash%29
Exception in thread check-bookmarks-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/linkscrubber/cmd/redirects.py", line 169, in _check_bookmarks_worker
    (bm['href'], bm['description'], err))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 186: ordinal not in range(128)

(Ctrl-C)
Mekk commented 10 years ago

With --debug:

changing http://feeds.webaudit.pl/~r/WebauditBlog/~3/xvRI0ZwOA3c/ to http://www.webaudit.pl/blog/2009/abc-marketingu-wirusowego-lacznicy-znawcy-rynku-i-sprzedawcy/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+WebauditBlog+%28WebAudit+Blog%29
Exception in thread check-bookmarks-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/linkscrubber/cmd/redirects.py", line 169, in _check_bookmarks_worker
    (bm['href'], bm['description'], err))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 186: ordinal not in range(128)

changing http://feeds.harvardbusiness.org/~r/harvardbusiness/~3/h1UJ5CwB5VI/when-a-colleagues-mistakes-aff.html to http://blogs.harvardbusiness.org/hmu/2009/10/when-a-colleagues-mistakes-aff.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+harvardbusiness+%28HBR.org%29
Exception in thread check-bookmarks-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/linkscrubber/cmd/redirects.py", line 169, in _check_bookmarks_worker
    (bm['href'], bm['description'], err))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 186: ordinal not in range(128)
Mekk commented 10 years ago

First of those links is likely visible to you: https://pinboard.in/search/u:Mekk?query=abc-marketingu-wirusowego-lacznicy-znawcy-rynku-i-sprzedawcy (if not: it has empty description, ascii tags "liderzy-opinii marketing rdrozd rodzaje-konsumentow" and title "ABC marketingu wirusowego: Łącznicy, znawcy rynku i sprzedawcy : WebAudit Blog" - with some non-latin characters)

Mekk commented 10 years ago

Second link is also public: https://pinboard.in/search/u:Mekk?query=when-a-colleagues-mistakes-aff

Here I have ascii title: "When a Colleague's Mistakes Affect You - Management Essentials - HarvardBusiness.org" and some tags are non-ascii: "błędy colleague cooperation mistakes napominanie reaction współpraca" Description is empty.

(looking at dates, both links were imported from delicious, but likely it does not matter)