annegerben / oversight

Automatically exported from code.google.com/p/oversight
0 stars 1 forks source link

Russian scraper bug adds the Russian letter "В" (it's not the English "B") before em-dash (—) and tripple dots (…) #554

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
- Russian Kinopoisk scraper has a bug - it adds the Russian letter "В" (it's 
not the English "B") before em-dash (—) and tripple dots (…) characters. 
For example, the plot from this page: 
http://www.kinopoisk.ru/level/1/film/43869/ looks like on the attached 
screenshot. I see from logs that the text is scraped this way, so it's not a 
displaying bug. You can try the filename Служебный.роман.1977.avi 
yourself.
Russian plots for TV-shows from thetvdb.com don't seem to have this problem.

Original issue reported on code.google.com by lordylo...@gmail.com on 12 Jan 2011 at 10:15

GoogleCodeExporter commented 8 years ago
Might be related to Issue 526 - check codepage - probably not utf-8

Original comment by lordylo...@gmail.com on 12 Jan 2011 at 10:21

GoogleCodeExporter commented 8 years ago
With latest version

 [—] (8212 Hex 2014)  is scraped as [—] (151 Hex 0097)

Original comment by a...@lordy.org.uk on 11 Jun 2011 at 11:10

GoogleCodeExporter commented 8 years ago
The additional [В] has gone - probably after fix to use UTF* composed form?

Original comment by a...@lordy.org.uk on 11 Jun 2011 at 11:11

GoogleCodeExporter commented 8 years ago
Source uses html  —   
eg
управления, — человек

but page also has Russian characters with following charset

<meta http-equiv="content-type" content="text/html; charset=windows-1251" />

Original comment by a...@lordy.org.uk on 11 Jun 2011 at 11:16

GoogleCodeExporter commented 8 years ago
This issue was closed by revision r1978.

Original comment by lordylo...@gmail.com on 12 Jun 2011 at 12:39

GoogleCodeExporter commented 8 years ago
Fixed r1978. 

Removed all awk based encoding. Use iconv of fail.
Replace all html number escapes before iconv conversion eg —
Changed html number escapes to convert to 8bit and not utf8 (avoid double 
conversion)

Original comment by a...@lordy.org.uk on 12 Jun 2011 at 12:39