Letractively / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

URL construction error for base URLs containing ? in path #24

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
For base URLs containing a "?" in path, HarvestMan does not construct
child URLs correctly.

Example:

Web page: http://razor.occams.info/code/repo/?/govtrack/sec

Child URL: ?/govtrack/sec/create_rdf.c

The full URL of child should be
http://razor.occams.info/code/repo/?/govtrack/sec/create_rdf.c, but
HarvestMan constructs it as,
http://razor.occams.info/code/repo/?/govtrack/?/govtrack/sec/create_rdf.c

>>> import urlparser
>>> h =
urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
>>> h2 = urlparser.HarvestManUrl('?/govtrack/sec/create_rdf.c', baseurl=h)
>>> print h2
http://razor.occams.info/code/repo/?/govtrack/?/govtrack/sec/create_rdf.c

The fix is to ignore all parts of the URL after and including "?" if a base
URL contains "?" inside the URL. I have a fix ready for this and I am doing
testing to make sure we don't have any regressions.

Original issue reported on code.google.com by abpil...@gmail.com on 29 Sep 2008 at 7:15

GoogleCodeExporter commented 8 years ago
Changes checked into trunk and unit-tests added to test_urlparser.py. Closing 
the bug.

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:19