bixo / bixo

Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications.
http:/openbixo.org
142 stars 42 forks source link

Measure performance during large-scale crawl using minimal HttpClient and lockless connection manager #54

Open kkrugler opened 11 years ago

kkrugler commented 11 years ago

Oleg had said:

You might want to have a look at the lest code in SVN trunk (to be released as 4.3). Several classes such as the scheme registry that previously had to be synchronized in order to ensure thread safety have been replaced with immutable equivalents. There is also now a way to create HttpClient in a minimal configuration without authentication, state management (cookies), proxy support and other non-essential functions.

The new API is not yet final and not properly documented. Presently this can be done with HttpClients#createMinimal

He also said:

I experimented with the idea of lock-less (unlimited) connection manager

This was in response to an issue I'd run into, where the single global lock on the connection pool was causing a lot of contention when many (hundreds) of threads were all fetching at the same time. He's provided source code, which unfortunately I can't attach here - but it's on my disk.