For base URLs containing a "?" in path, HarvestMan does not construct
child URLs correctly.
Example:
Web page: http://razor.occams.info/code/repo/?/govtrack/sec
Child URL: ?/govtrack/sec/create_rdf.c
The full URL of child should be
http://razor.occams.info/code/repo/?/govtrack/sec/create_rdf.c, but
HarvestMan constructs it as,
http://razor.occams.info/code/repo/?/govtrack/?/govtrack/sec/create_rdf.c
>>> import urlparser
>>> h =
urlparser.HarvestManUrl('http://razor.occams.info/code/repo/?/govtrack/sec')
>>> h2 = urlparser.HarvestManUrl('?/govtrack/sec/create_rdf.c', baseurl=h)
>>> print h2
http://razor.occams.info/code/repo/?/govtrack/?/govtrack/sec/create_rdf.c
The fix is to ignore all parts of the URL after and including "?" if a base
URL contains "?" inside the URL. I have a fix ready for this and I am doing
testing to make sure we don't have any regressions.
Original issue reported on code.google.com by abpil...@gmail.com on 29 Sep 2008 at 7:15
Original issue reported on code.google.com by
abpil...@gmail.com
on 29 Sep 2008 at 7:15