dart-lang / tools

This repository is home to tooling related Dart packages.
BSD 3-Clause "New" or "Revised" License
32 stars 26 forks source link

Unable to parse with non-UTF-8 charset #1037

Open phofman opened 4 years ago

phofman commented 4 years ago

Localized web-page containing following tag within its head won't be correctly decoded:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-2" />

And there are few problems actually:

  1. To trigger any content-conversion logic, the HtmlParser::parse() method needs to be called with input parameter presented as List<int> or Uint8List. Otherwise, when it's given as a string it will be always assumed as UTF-8 encoded, thus giving wrong texts.
  2. Data above is currently ignored by HtmlParser even if passed as List<int>. Internally ContentAttrParser::parse() reads the unquoted charset content as an empty string.
  3. Encoding-detection assumes it's located within first 512 bytes and this limit can't be changed via any parameter, still leading to meta tag skipped in some cases.
  4. Even, if the buggy behavior is fixed, code crashes later in html_input_stream.dart method _decodeBytes() as currently only UTF-8 and ASCII encodings are supported. I understand, that those are only two supported by Dart by now, but even there is no way to inject a own/custom decoder to handle this encoding and code ends up with ArgumentError.
kainy commented 3 years ago

me too