amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons
0 stars 0 forks source link

HttpComponents Upgrade: 4.1.1 -> 4.2.1 #5

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Version 4.2.1 has better multithreading support; using connection pool we can 
better manage Keep-Alive and Cookie Store

This is first step before implementing in-full Cookie support (which currently 
does not work; you can notice it from presence of Session IDs in URLs)

Original issue reported on code.google.com by fuad.efe...@tokenizer.ca on 6 Oct 2012 at 3:25

GoogleCodeExporter commented 8 years ago
See attachments; I did two commits - so, two patches... I don't know how to 
create single patch...

I also implemented multithreading & Cache for Cookie Store (cookies were not 
cached before!!!)

Mistakenly patches named "issue6" instead of "issue5"; apply "part1", then 
"part2". Thanks,

Fuad Efendi

Original comment by fuad.efe...@tokenizer.ca on 6 Oct 2012 at 6:23

Attachments:

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
1. LocalCookieStore was initially from 
HttpComponents#org.apache.http.impl.client.BasicCookieStore

2. Removed "synchronized" modifiers

3. LocalCookieStore is used as ThreadLocal. I hope HttpClient calls sometimes 
"clearExpired(Date)"; need to verify/implement, otherwise it is candidate for 
memory leaks

Main reason for that: I want to make multithreaded access to (DefaultHttpClient 
_httpClient) as fast as possible; from my experience, 1000 threads in average 
per JVM, accessing this single instance, single thread per single internet host 
-see "tokenizer" project at https://github.com/FuadEfendi -it uses distributed 
"executor" framework managed by ZooKeeper which can run for instance web 
crawler (backed by HBase)

I/O bound, so that big number of threads per Crawler is very natural.

Another reason: support for authenticated sessions.

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 1:44

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
MyRedirectStrategy contains this logic:
========================================
            // HACK - some sites return a redirect with an explicit port number
            // that's the same as
            // the default port (e.g. 80 for http), and then when you use this
            // to make the next
            // request, the presence of the port in the domain triggers another
            // redirect, so you
            // fail with a circular redirect error. Avoid that by converting the
            // port number to
            // -1 in that case.

However, in this case http://www.test.com -> ... -> http://www.test.com and 
again we have nothing but CircularRedirectException

A. http://www.test.com -> http://www.test.com:80 is proper behavior
B http://www.test.com:80 ->  http://www.test.com should throw 
CircularRedirectException if previous step was A

So that I suggest to remove this logic...

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 3:41

GoogleCodeExporter commented 8 years ago
Another use case:

http://www.test.com/MyPage -> http://www.test.com:80/MyRedirectedPage -> 
http://www.test.com/MyRedirectedPage

We can save some bandwidth if we keep this "hack"...

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 3:44

GoogleCodeExporter commented 8 years ago
This code will throw exception for RedirectMode.FOLLOW_TEMP && 
HttpStatus.SC_MOVED_TEMPORARILY; this is a bug

            boolean isPermRedirect = statusCode == HttpStatus.SC_MOVED_PERMANENTLY;
            if ((_redirectMode == RedirectMode.FOLLOW_NONE) || ((_redirectMode == RedirectMode.FOLLOW_TEMP) && isPermRedirect)) {
                RedirectExceptionReason reason = isPermRedirect ? RedirectExceptionReason.PERM_REDIRECT_DISALLOWED : RedirectExceptionReason.TEMP_REDIRECT_DISALLOWED;
                throw new MyRedirectException("RedirectMode disallowed redirect: " + _redirectMode, result, reason);
            }

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 3:58

GoogleCodeExporter commented 8 years ago
Comment13: patch attached

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 4:29

Attachments:

GoogleCodeExporter commented 8 years ago
sorry forgot to remove few lines of code...

Original comment by fuad.efe...@tokenizer.ca on 7 Oct 2012 at 4:35

Attachments:

GoogleCodeExporter commented 8 years ago
Can this issue be closed? It looks like there's another issue about upgrading 
to HttpComponents 4.2.2, and I believe there was discussion on the list about 
filing separate issues for each of the other issues mentioned here.

Original comment by kkrugler...@transpac.com on 6 Nov 2012 at 4:36

GoogleCodeExporter commented 8 years ago
yes it can be closed; I have total 3 open issues which I tried to provide 
patches:
1. SiteMap defect, http://code.google.com/p/crawler-commons/issues/detail?id=3
2. HTTP status code, http://code.google.com/p/crawler-commons/issues/detail?id=6
3. HttpComponents Upgrade (huge one; includes some functionality improvements 
and even bug fixes.) http://code.google.com/p/crawler-commons/issues/detail?id=8

So I can do 3 separate patches as per "corporate standards" ;) but first two 
are really trivial...

Original comment by fuad.efe...@tokenizer.ca on 6 Nov 2012 at 2:58

GoogleCodeExporter commented 8 years ago
Separate issues created as per standard open source project approach.

Original comment by kkrugler...@transpac.com on 6 Nov 2012 at 3:01