Closed barryhunter closed 2 years ago
That's strange. Will that page be available for a while? I'll try to debug this when I get a chance (hopefully this weekend)
On Fri, Apr 8, 2022 at 12:39 PM barryhunter @.***> wrote:
I've got a strange issue with Siege fetching a URL that not in the source of the page
Can be reproduced with a single '--print' request... $ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...
GET /photo/Lane, HTTP/1.0
Transactions: 2 hits Availability: 100.00 % Elapsed time: 0.05 secs
In a normal run (without -p) shows it a 404 HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.
The only place word Lane has a comma, is in the meta description $ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane,
Not sure why Lane, would be singled out in that text as being worthy of fetching.
— Reply to this email directly, view it on GitHub https://github.com/JoeDog/siege/issues/208, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJRHZXFGHLRUMOAVEG6PUTVEBOMPANCNFSM5S5CNKOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes
Actually think figured it out. Went digging in the source....
It tries to extract URLs from 'meta refresh' links https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156
/* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */
Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag
if (__strcasestr(ptr, "url") != NULL) {
And my description has the token "url" in there! Burleigh - so it then seems to just use the next word as a relative link.
Not sure if upto recompiling the code, but seems like would be better changed to something like
if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {
Not sure if that will work in C or not. (my C is very rusty!)
Another example with url in the description to confirm...
$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
<meta name="description" content="SU5016 :: Durley Church, near to..." />
GET /photo/Church, HTTP/1.0
Oh, didn't see your reply. Thanks!
Yes, that page should remain online long term :) Feel free to make requests, to the domain for testing. Although not large numbers of concurrent requests ;p
You want this:
if (strcasestr(ptr, "; url=") != NULL || strcasestr(ptr, ";url=") != NULL) {
(you missed an underscore in the second function call)
I'll test it out but if you put that line in you'll be on the code base if this tests out
On Fri, Apr 8, 2022 at 12:53 PM barryhunter @.***> wrote:
Actually think figured it out. Went digging in the source....
It tries to extract URLs from 'meta refresh' links
https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156 / /
Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag
if (__strcasestr(ptr, "url") != NULL) {
And my description has the token "url" in there! Burleigh
Not sure if upto recompiling the code, but seems like would be better changed to something like
if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {
Not sure if that will work in C or not. (my C is very rusty!)
Another example with url in the description to confirm...
$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
GET /photo/Church, HTTP/1.0
— Reply to this email directly, view it on GitHub https://github.com/JoeDog/siege/issues/208#issuecomment-1093084882, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJRHZXA6THM22XAEZEFTTTVEBP7LANCNFSM5S5CNKOA . You are receiving this because you commented.Message ID: @.***>
-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes
I've got a strange issue with Siege fetching a URL that not in the source of the page
Can be reproduced with a single '--print' request...
$ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...In a normal run (without -p) shows it a 404
HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.The only place word Lane has a comma, is in the meta description
Not sure why Lane, would be singled out in that text as being worthy of fetching.