HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
863 stars 171 forks source link

What's the philosophy of HtmlUnit when a response contains a header "Content-Type: application/octet-stream" #611

Closed qurikuduo closed 1 year ago

qurikuduo commented 1 year ago

Hi there, Some URL have a response with header "Content-Type: application/octet-stream". Should I process it as an attachment? After some digs, The Attachment only handle specific response which define in rfc-2183. the : attachmentHandler_.isAttachment(webResponse) will return False when we have "application/octet-stream". I found org.htmlunit.HttpWebConnection.downloadContent() will be called: public static DownloadedContent downloadContent(final InputStream is, final int maxInMemory) It will download the response content. If I DON'T want HtmlUnit to download big content( e.g. https://dg.10000gd.tech:12348/shmfile/100 ), what should I do? I want to block download action if a resource lager than 20MB to save on bandwidth.

Thanks a lot.

rbri commented 1 year ago

Maybe a simple solution is to set up your own WebConnectionWrapper and intercept the request url's. For the large ones don's call super and simply return a static response.

see https://www.htmlunit.org/faq.html#HowToModifyRequestOrResponse as starting point

rbri commented 1 year ago

will try to make a bit more detailed description ....

qurikuduo commented 1 year ago

Sounds like an option.

  1. Specify my own WebConnectionWrapper.
  2. Try to get content-length which defined in response Headers
  3. If content-length not defined, try to implement my own HttpWebConnection implements WebConnection interface, then I will determine the response body is too large to be blocked in public static DownloadedContent downloadContent(): When while( readCount = InputStream.read(buffer) !=0){ //... } Is it a solution? thx.
qurikuduo commented 1 year ago

After trying a few small tricks, I achieved the functionality I wanted. Here is what I did:

  1. Specify my own WebConnectionWrapper copied from HttpWebConnection and put it in package org.htmlunit : public class MyxxHttpWebConnection extends HttpWebConnection,
    Override public WebResponse getResponse(final WebRequest webRequest) and get content-length by read : httpResponse.getFirstHeader(ContentLength).getValue() , determine if it is too large: if(contentLengthLong> maxContentLength){ System.out.println("Content is too big. url="+webRequest.getUrl().toString()+" contentLength = " + contentLengthLong + ", maxContentLength = " + maxContentLength);
    httpMethod.abort(); httpResponse.setEntity(null);
    }

  2. Specify my own AttachmentHandler: public class MyxxAttachmentHandler implements AttachmentHandler @Override: public void handleAttachment(final Page page) { //not download attachment lager than 100KB if(page.getWebResponse().getContentLength() > maxAttachmentSize){ System.out.println("Attachment is too big. url=" + page.getUrl()+" contentLength = " + page.getWebResponse().getContentLength() + ", maxAttachmentSize = " + maxAttachmentSize); try { page.getEnclosingWindow().getWebClient().getWebConnection().close(); }catch(Exception e){ logger.error("Error when close attachment download.", e); } finally { try { page.getWebResponse().cleanUp();//new AbstractPage(page.getWebResponse(),page.getEnclosingWindow())) ; page.getEnclosingWindow().setEnclosedPage(new HtmlPage(createWebResponse(new WebRequest(page.getUrl(),page.getWebResponse().getWebRequest().getHttpMethod()), "", page.getWebResponse().getContentType(), page.getWebResponse().getStatusCode(),page.getWebResponse().getStatusMessage()),page.getEnclosingWindow()));
    } catch (Exception e) { logger.error("Error when close attachment download.", e); } return; } } else { //if not response collectedAttachments_.add(new Attachment(page)); } }

  3. Create new instance before calling getPage(url):

webClient.setAttachmentHandler(new MyxxAttachmentHandler(attachmentList) ); new WebConnectionWrapper(webClient) { public WebResponse getResponse(WebRequest request) throws IOException { MyxxHttpWebConnection webConnection = new MyxxHttpWebConnection(webClient); return webConnection.getResponse(request); } }; page=webClient.getPage(url) if(attachmentList.size()>0){ //download attachment. long contentLength = attachement.getPage().getWebResponse().getContentLength(); if(contentLength==0||(contentLength>MyxxAttachmentHandler.maxAttachmentSize)){ System.out.println("attachment too large, will not save to disk. contentLength = "+contentLength); continue; } else{ //save attachment to file. } }

It is work for me now.

rbri commented 1 year ago

Hi @qurikuduo,

slowly i got an idea what you like to do. I made some small changes and now i can do something like this.

@Test
public void contentBlocking() throws Exception {
    final byte[] content = new byte[] {77, 44};
    final List<NameValuePair> headers = new ArrayList<>();
    headers.add(new NameValuePair("Content-Encoding", "gzip"));
    headers.add(new NameValuePair(HttpHeader.CONTENT_LENGTH, String.valueOf(content.length)));

    final MockWebConnection conn = getMockWebConnection();
    conn.setResponse(URL_FIRST, content, 200, "OK", MimeType.APPLICATION_JSON, headers);

    startWebServer(getMockWebConnection());

    final WebClient client = getWebClient();
    client.setWebConnection(new HttpWebConnection(client) {
        @Override
        protected WebResponse downloadResponse(final HttpUriRequest httpMethod,
                final WebRequest webRequest, final HttpResponse httpResponse,
                final long startTime) {

            // check the header here if you like
            // call return super.downloadResponse() in case you are happy with the headers

            httpMethod.abort();

            // create empty response and mark as blocked for later
            final DownloadedContent downloaded = new DownloadedContent.InMemory(null);
            final long endTime = System.currentTimeMillis();
            final WebResponse response = makeWebResponse(httpResponse, webRequest, downloaded, endTime - startTime);
            response.markAsBlocked("test blocking");
            return response;
        }
    });

    final UnexpectedPage page = client.getPage(URL_FIRST);
    assertTrue(page.getWebResponse().wasBlocked());
    assertEquals("test blocking", page.getWebResponse().getBlockReason());
}

Will this help to simplify your code? do you need some other changes for your case?

rbri commented 1 year ago

@qurikuduo just made a new snapshot build - please try

3.4.0-SNAPSHOT

rbri commented 1 year ago

Have update the documentation a bit - https://www.htmlunit.org/details.html Hope that helps.

rbri commented 1 year ago

Will close this, hope the changes and the docu are sufficient

qurikuduo commented 1 year ago

Than you very much.