MachinePublishers / jBrowserDriver

A programmable, embeddable web browser driver compatible with the Selenium WebDriver spec -- headless, WebKit-based, pure Java
Other
809 stars 143 forks source link

Fix wrong encoding for getPageSource #233

Open andy-betin opened 7 years ago

andy-betin commented 7 years ago

I would like to propose a possible fix for incorrect encoding detection sites.

With

com.googlecode.juniversalchardet juniversalchardet 1.0.3

you can fix com.machinepublishers.jbrowserdriver.Util

with next code for auto detect encoding:

static String charset(URLConnection conn) { String charset = conn.getContentType(); if (charset != null) { Matcher matcher = charsetPattern.matcher(charset); if (matcher.find()) { charset = matcher.group(1); if (Charset.isSupported(charset)) { return charset; } } } try { return checkCharset(conn.getInputStream()); }catch (Exception e){ return "utf-8"; }

}

public static String checkCharset(InputStream is){ try {

  UniversalDetector detector = new UniversalDetector(null);
  byte[] buf = new byte[4096];
  int nread;
  ByteArrayOutputStream copyNext = new ByteArrayOutputStream(); //we need to copy stream, because not supported reset
  while ((nread = is.read(buf)) > 0) {
    copyNext.write(buf, 0, nread);
    if (!detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }else{
      break;
    }
  }
  // (3)
  detector.dataEnd();
  is.close();
  copyNext.close();
  return detector.getDetectedCharset();
}catch (Exception e){
  return "utf-8";
}

}

hollingsworthd commented 7 years ago

Noted and thanks for the example! I'm not keen on adding more dependencies (would like to remove some if possible) but this will be a good starting point to look into the issue and see if we can get comparable functionality like this.