Closed GoogleCodeExporter closed 8 years ago
See attachments; I did two commits - so, two patches... I don't know how to
create single patch...
I also implemented multithreading & Cache for Cookie Store (cookies were not
cached before!!!)
Mistakenly patches named "issue6" instead of "issue5"; apply "part1", then
"part2". Thanks,
Fuad Efendi
Original comment by fuad.efe...@tokenizer.ca
on 6 Oct 2012 at 6:23
Attachments:
[deleted comment]
[deleted comment]
[deleted comment]
1. LocalCookieStore was initially from
HttpComponents#org.apache.http.impl.client.BasicCookieStore
2. Removed "synchronized" modifiers
3. LocalCookieStore is used as ThreadLocal. I hope HttpClient calls sometimes
"clearExpired(Date)"; need to verify/implement, otherwise it is candidate for
memory leaks
Main reason for that: I want to make multithreaded access to (DefaultHttpClient
_httpClient) as fast as possible; from my experience, 1000 threads in average
per JVM, accessing this single instance, single thread per single internet host
-see "tokenizer" project at https://github.com/FuadEfendi -it uses distributed
"executor" framework managed by ZooKeeper which can run for instance web
crawler (backed by HBase)
I/O bound, so that big number of threads per Crawler is very natural.
Another reason: support for authenticated sessions.
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 1:44
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
MyRedirectStrategy contains this logic:
========================================
// HACK - some sites return a redirect with an explicit port number
// that's the same as
// the default port (e.g. 80 for http), and then when you use this
// to make the next
// request, the presence of the port in the domain triggers another
// redirect, so you
// fail with a circular redirect error. Avoid that by converting the
// port number to
// -1 in that case.
However, in this case http://www.test.com -> ... -> http://www.test.com and
again we have nothing but CircularRedirectException
A. http://www.test.com -> http://www.test.com:80 is proper behavior
B http://www.test.com:80 -> http://www.test.com should throw
CircularRedirectException if previous step was A
So that I suggest to remove this logic...
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 3:41
Another use case:
http://www.test.com/MyPage -> http://www.test.com:80/MyRedirectedPage ->
http://www.test.com/MyRedirectedPage
We can save some bandwidth if we keep this "hack"...
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 3:44
This code will throw exception for RedirectMode.FOLLOW_TEMP &&
HttpStatus.SC_MOVED_TEMPORARILY; this is a bug
boolean isPermRedirect = statusCode == HttpStatus.SC_MOVED_PERMANENTLY;
if ((_redirectMode == RedirectMode.FOLLOW_NONE) || ((_redirectMode == RedirectMode.FOLLOW_TEMP) && isPermRedirect)) {
RedirectExceptionReason reason = isPermRedirect ? RedirectExceptionReason.PERM_REDIRECT_DISALLOWED : RedirectExceptionReason.TEMP_REDIRECT_DISALLOWED;
throw new MyRedirectException("RedirectMode disallowed redirect: " + _redirectMode, result, reason);
}
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 3:58
Comment13: patch attached
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 4:29
Attachments:
sorry forgot to remove few lines of code...
Original comment by fuad.efe...@tokenizer.ca
on 7 Oct 2012 at 4:35
Attachments:
Can this issue be closed? It looks like there's another issue about upgrading
to HttpComponents 4.2.2, and I believe there was discussion on the list about
filing separate issues for each of the other issues mentioned here.
Original comment by kkrugler...@transpac.com
on 6 Nov 2012 at 4:36
yes it can be closed; I have total 3 open issues which I tried to provide
patches:
1. SiteMap defect, http://code.google.com/p/crawler-commons/issues/detail?id=3
2. HTTP status code, http://code.google.com/p/crawler-commons/issues/detail?id=6
3. HttpComponents Upgrade (huge one; includes some functionality improvements
and even bug fixes.) http://code.google.com/p/crawler-commons/issues/detail?id=8
So I can do 3 separate patches as per "corporate standards" ;) but first two
are really trivial...
Original comment by fuad.efe...@tokenizer.ca
on 6 Nov 2012 at 2:58
Separate issues created as per standard open source project approach.
Original comment by kkrugler...@transpac.com
on 6 Nov 2012 at 3:01
Original issue reported on code.google.com by
fuad.efe...@tokenizer.ca
on 6 Oct 2012 at 3:25