ariya / phantomjs

Scriptable Headless Browser
http://phantomjs.org
BSD 3-Clause "New" or "Revised" License
29.46k stars 5.76k forks source link

Custom HTTP header fields order #11859

Closed iddl closed 8 years ago

iddl commented 10 years ago

According to RFC2616 (http://www.w3.org/Protocols/rfc2616/rfc2616.html)

the order in which header fields with differing field names are received is not significant

however, some implementations may pay particular attention to the order of these fields.

Are there any plans to support custom orderings ?

eg. Have "Connection: Keep-Alive" come before "Accept-Encoding: gzip".

A webpage property similar to this would probably work: page.headerFieldsOrder = ["Accept", "Accept-Language", "Host", "Connection"...];

Thanks

JamesMGreene commented 10 years ago

What benefit does it have?

iddl commented 10 years ago

It may not be a critical enhancement but it would likely improve the flexibility of the tool.

I have been fiddling around with phantomjs and CDNs and found out some services like Incapsula may be looking at the order of HTTP request headers other than values to determine the type of browser.

Here are the images of two GET requests, the first made by Firefox 26 and the second by phantomjs using customHeaders to mimic Firefox.

Firefox:

firefox

Phantomjs with customHeaders:

phantomjs

Below is the code I used to set the headers. Some of the field values may not be compatible, however, my goal was to get two identical HTTP responses from the server.

page.customHeaders = {
    "User-Agent" : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection" : "keep-alive"
};

Given the same values for the headers I would expect identical responses, for some reason this doesn't happen. The order of the fields may make a difference.

Thoughts ?

ghost commented 10 years ago

I also find this feature useful. Any plans to add it soon?

RobinDev commented 9 years ago

:+1: +1

Kirzilla commented 9 years ago

+1 Without possibility to modify HTTP headers order it is impossible to fetch sites protected with Incapsula.

xckon commented 9 years ago

+1

dufoli commented 8 years ago

+1

GunsAkimbo commented 8 years ago

I have successfully made PhantomJS fetch a page from an Incapsula-protected site (www.enjin.com), with modification of two files:

  1. qhttpnetworkrequest.cpp (fixed http header field ordering)
  2. webpage.cpp (renamed the phantom-callback object name, in order for the Incapsula js-test not finding it)

gist (including example script): https://gist.github.com/GunsAkimbo/aa6ac81bd55dd1802637

It's not pretty, just a proof-of-concept of the changes to make in order for not be stopped by Incapsula.

zackw commented 8 years ago

So in principle we are interested in making changes like these. However:

GunsAkimbo commented 8 years ago
  1. Nope, rejected with reason "Out of scope": https://bugreports.qt.io/browse/QTBUG-49659
  2. I know, it was not meant for a real patch, just something quick and dirty if someone really wanted to compile something that "works".
yegors commented 8 years ago

I'm using PhantomJS for a website screenshotting service, and I'm unable do much on Incapsula protected sites. This is a clear issue for multiple people, don't fully understand why its not being addressed....

GunsAkimbo commented 8 years ago

@yegors Well, PJS uses an external library for network communication (QT), and I think the problem lies partly there, which means we cannot fix that in this repo. The other part is to hide the global object with a fixed name "_phantom" or have the ability to rename it runtime before loading a page.

You could try to compile a build yourself, with the changes mentioned in the gist I linked to a few posts back. Those changes made it possible to pass through the protection, but I'm not sure if both changes were necessary.

As an alternative, you could look into https://phantomjscloud.com/site/index.html I have had good results using this service for incapsula-protected sites.

yegors commented 8 years ago

@GunsAkimbo Its pretty unfortunate that QT refuses to fix it on their end. Custom compile seems like its the best (only) option at this point, as a 3rd party service is out of the question for our applications. Will try your patch and see what happens.

maximilianh commented 8 years ago

Whew, thanks guys for documenting this! I would have wasted hours trying to support incapsula. Will move my scripts now to firefox/selenium.

opahopa commented 8 years ago

@GunsAkimbo trying to implement your fix. qhttpnetworkrequest.cpp where is this file located? cant find in this repo.

GunsAkimbo commented 8 years ago

@opahopa That file belongs to the QT-repo, it used to be referenced in the .gitmodules-file, I guess that has been changed, according to the history.

opahopa commented 8 years ago

@GunsAkimbo any idea how to change the headers order now?

vitallium commented 8 years ago

Since this problem marked as out-of-scope by Qt I believe we can close it too. Because future versions of PhantomJS will use the system-installed (or original version) of Qt.

Also, RFC describes that the order of HTTP headers doesn't matter.

Thanks!

maximilianh commented 8 years ago

The order does actually matter. Some anti-bot systems use it to identify phantom and block requests with a particular order.

On Sep 9, 2016 8:02 AM, "Vitaly Slobodin" notifications@github.com wrote:

Since this problem marked as out-of-scope by Qt I believe we can close it too. Because future versions of PhantomJS will use the system-installed (or original version) of Qt.

Also, RFC describes that the order of HTTP headers doesn't matter.

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ariya/phantomjs/issues/11859#issuecomment-245939504, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TfrLzR6eBpwK94nJ9a3fE3af2Punks5qoXTsgaJpZM4BWy1A .

vitallium commented 8 years ago

Yes, I know that. But the problem is that implementing this feature require a custom version of Qt. We want to move away from custom (patched) version to the original version.

annulen commented 8 years ago

@maximilianh You can put PhantomJS behind a proxy which reorders headers as you want. Such proxy could be implemented in any language without much trouble, or may be there is an existing solution

palnabarun commented 7 years ago

@annulen Do you know of any such proxy service providers or proxy libraries in Python/Ruby/NodeJS which can reorder the headers? I have tried many libraries which can modify the headers but they cannot reorder them. Any help is appreciated. Thanks.

annulen commented 7 years ago

Is header reordering required for HTTPS, or plain HTTP is enough?

palnabarun commented 7 years ago

@annulen : It would be better if possible for both. Otherwise plain HTTP is also OK.

annulen commented 7 years ago

For HTTPS it would require "bumping" SSL connections which would significantly complicate code of proxy, even if we don't consider things like using client certificates or validating server certificate on client side. In case HTTPS is needed it's indeed much easier solution to fix order on client side, i.e. patch Qt.

If you are only concerned with Host header position, it would be better to write a patch for https://bugreports.qt.io/browse/QTBUG-51557, it will be accepted

mikeevstropov commented 6 years ago

Do we have any other soluion other than proxy?

annulen commented 6 years ago

Fix the code, seriously.

There are 2 independent issues here:

annulen commented 6 years ago

Update: patch for QTBUG-51557 will be included into Qt 5.10.1, see https://codereview.qt-project.org/#/c/216980/.

timng commented 6 years ago

Hello, I implement @GunsAkimbo concept to bypass Incapsula. You can download phantomjs at https://drive.google.com/drive/folders/1Y0XqQ89hQUhDj9_EPW-kja8V1vX4Catf?usp=sharing

There're 2 files, 1 for window & 1 for linux.