Siege fetching a non-existent URL not in source

barryhunter commented 2 years ago

I've got a strange issue with Siege fetching a URL that not in the source of the page

Can be reproduced with a single '--print' request... $ siege -p https://www.geograph.org.uk/photo/9 | grep Lane Shows a fetch to /Lane, which doesn't exist...

GET /photo/Lane, HTTP/1.0

Transactions:                      2 hits
Availability:                 100.00 %
Elapsed time:                   0.05 secs

In a normal run (without -p) shows it a 404 HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane, The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.

The only place word Lane has a comma, is in the meta description

$ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane,
        <meta name="description" content="SO8601 :: Burleigh Lane, near to Minchinhampton, Gloucestershire, Great Britain by Helena Downton" />

Not sure why Lane, would be singled out in that text as being worthy of fetching.

JoeDog commented 2 years ago

That's strange. Will that page be available for a while? I'll try to debug this when I get a chance (hopefully this weekend)

On Fri, Apr 8, 2022 at 12:39 PM barryhunter @.***> wrote:

I've got a strange issue with Siege fetching a URL that not in the source of the page

Can be reproduced with a single '--print' request... $ siege -p https://www.geograph.org.uk/photo/9 | grep Lane

Shows a fetch to /Lane, which doesn't exist...

GET /photo/Lane, HTTP/1.0

Transactions: 2 hits Availability: 100.00 % Elapsed time: 0.05 secs

In a normal run (without -p) shows it a 404 HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,

The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.

The only place word Lane has a comma, is in the meta description $ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane,

Not sure why Lane, would be singled out in that text as being worthy of fetching.

— Reply to this email directly, view it on GitHub https://github.com/JoeDog/siege/issues/208, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJRHZXFGHLRUMOAVEG6PUTVEBOMPANCNFSM5S5CNKOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes

barryhunter commented 2 years ago

Actually think figured it out. Went digging in the source....

It tries to extract URLs from 'meta refresh' links https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156

  /* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */

Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag

        if (__strcasestr(ptr, "url") != NULL) {

And my description has the token "url" in there! Burleigh - so it then seems to just use the next word as a relative link.

Not sure if upto recompiling the code, but seems like would be better changed to something like

   if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {

Not sure if that will work in C or not. (my C is very rusty!)

Another example with url in the description to confirm...

$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
 <meta name="description" content="SU5016 :: Durley Church, near to..." />
GET /photo/Church, HTTP/1.0

barryhunter commented 2 years ago

Oh, didn't see your reply. Thanks!

Yes, that page should remain online long term :) Feel free to make requests, to the domain for testing. Although not large numbers of concurrent requests ;p

JoeDog commented 2 years ago

You want this:

if (strcasestr(ptr, "; url=") != NULL || strcasestr(ptr, ";url=") != NULL) {

(you missed an underscore in the second function call)

I'll test it out but if you put that line in you'll be on the code base if this tests out

On Fri, Apr 8, 2022 at 12:53 PM barryhunter @.***> wrote:

Actually think figured it out. Went digging in the source....

It tries to extract URLs from 'meta refresh' links

https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156 / /

Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag
    if (__strcasestr(ptr, "url") != NULL) {
And my description has the token "url" in there! Burleigh

Not sure if upto recompiling the code, but seems like would be better changed to something like

if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {

Not sure if that will work in C or not. (my C is very rusty!)

Another example with url in the description to confirm...

$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P

GET /photo/Church, HTTP/1.0

— Reply to this email directly, view it on GitHub https://github.com/JoeDog/siege/issues/208#issuecomment-1093084882, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJRHZXA6THM22XAEZEFTTTVEBP7LANCNFSM5S5CNKOA . You are receiving this because you commented.Message ID: @.***>

-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes

JoeDog / siege

Siege fetching a non-existent URL not in source #208