Closed GoogleCodeExporter closed 8 years ago
Might be related to Issue 526 - check codepage - probably not utf-8
Original comment by lordylo...@gmail.com
on 12 Jan 2011 at 10:21
With latest version
[—] (8212 Hex 2014) is scraped as [] (151 Hex 0097)
Original comment by a...@lordy.org.uk
on 11 Jun 2011 at 11:10
The additional [В] has gone - probably after fix to use UTF* composed form?
Original comment by a...@lordy.org.uk
on 11 Jun 2011 at 11:11
Source uses html —
eg
управления, — человек
but page also has Russian characters with following charset
<meta http-equiv="content-type" content="text/html; charset=windows-1251" />
Original comment by a...@lordy.org.uk
on 11 Jun 2011 at 11:16
This issue was closed by revision r1978.
Original comment by lordylo...@gmail.com
on 12 Jun 2011 at 12:39
Fixed r1978.
Removed all awk based encoding. Use iconv of fail.
Replace all html number escapes before iconv conversion eg —
Changed html number escapes to convert to 8bit and not utf8 (avoid double
conversion)
Original comment by a...@lordy.org.uk
on 12 Jun 2011 at 12:39
Original issue reported on code.google.com by
lordylo...@gmail.com
on 12 Jan 2011 at 10:15