LWP::UserAgent uses IO::HTML for determining HTML page's encoding and calls mime_name() method of the returned Encode::Encoding object.
This fails with Chinese webpages encoded with GB2312. IO::HTML determined the encoding of said pages to be "euc-cn" but it's not mapped to "EUC-CN" in Encode::MIME::Name and so the subsequent content decoding fails.
To reproduce:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use LWP::UserAgent;
my $url = 'https://sandbox.pypt.lt/gb2312.html';
my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
if ($response->is_success) {
say "Charset: " . $response->content_charset(); # undef; should be "EUC-CN"
say $response->decoded_content(); # garbled content; should be decoded Chinese text
} else {
die $response->status_line;
}
A simple 'euc-cn' => 'EUC-CN' mapping fixes everything.
LWP::UserAgent uses IO::HTML for determining HTML page's encoding and calls
mime_name()
method of the returned Encode::Encoding object.This fails with Chinese webpages encoded with GB2312. IO::HTML determined the encoding of said pages to be "euc-cn" but it's not mapped to "EUC-CN" in Encode::MIME::Name and so the subsequent content decoding fails.
To reproduce:
A simple
'euc-cn' => 'EUC-CN'
mapping fixes everything.