ComplexHTTPRequestHandler is doing excessive amount of work for non-complex cases

mosra commented 6 years ago

Spinning up a Pelican server after #2324 was merged, every static request prints the following in the log:

-> Tried to find `/static/ship.jpg.html`, but it doesn't exist.
-> Tried to find `/static/ship.jpg/index.html`, but it doesn't exist.
-> Tried to find `/static/ship.jpg/`, but it doesn't exist.
-> Found `/static/ship.jpg`.
127.0.0.1 - - [16/Nov/2018 19:16:29] "GET /static/ship.jpg HTTP/1.1" 200 -

And in case of pages it's like this:

-> Tried to find `/css/components/test.html`, but it doesn't exist.
-> Found `/css/components/test/index.html`.
127.0.0.1 - - [16/Nov/2018 19:16:32] "GET /css/components/test/ HTTP/1.1" 200 -

To me that seems quite excessive, since the GET request is already containing the correct path and so trying that first (as it was before #2324) makes the most sense (to me at least). In other words, I have the layout done in a way that doesn't require any complex handling -- the files are just where a simple GET expects them to be.

Seeing the amount of PRs submitted just for this alone I realize this is a mine-field and every fix breaks someone's else workflow. Since I didn't find a clear reason in the PR descirption, my question is, @oulenz (since you submitted the aforementioned PR), what exactly was the reason to not try the '' suffix first (the ideal case, basically) and what would break if it did try that first?

(Looks like this is my first issue for this project ever, interesting.)

oulenz commented 6 years ago

To answer your last question: if you try '' first, then looking for '/index.html' later is pointless because if 'foo/index.html' exists, then so does 'foo'.

I am not convinced the 'unnecessary' attempts to get a resource are problematic in terms of performance, especially since this is just a test server used during development.

However, one thing we could do is to simplify to just checking '' if the requested path already contains an extension. That would only negatively affect directory names with internal ., but that's an edge case I would be happy to ignore.

mosra commented 6 years ago

if you try '' first, then looking for '/index.html' later is pointless

Thinking outside of the box: do we actually need to look for /index.html? As far as my experience goes with plain python -m http.server, it was never a problem. Assuming I have a file at css/index.html:

if I ask for http://localhost:8000/css, the log of python -m http.server shows

127.0.0.1 - - [16/Nov/2018 20:21:02] "GET /css HTTP/1.1" 301 -
127.0.0.1 - - [16/Nov/2018 20:21:02] "GET /css/ HTTP/1.1" 200 -

and serves me the index.html correctly.

if I ask for http://localhost:8000/css/, the log of python -m http.server shows
```
127.0.0.1 - - [16/Nov/2018 20:25:36] "GET /css/ HTTP/1.1" 200 -
```
and serves me the index.html correctly.
if I ask for http://localhost:8000/css/index.html, the log of python -m http.server shows
```
127.0.0.1 - - [16/Nov/2018 20:26:55] "GET /css/index.html HTTP/1.1" 200 -
```
and again serves me the index.html correctly.

So, why exactly do we want to explicitly prioritize serving dir/index.html over dir/ when serving a directory "just works"?

Is that to avoid displaying the builtin directory listing when index.html doesn't exist?
Or is that to disambiguate the case when there's both a dir/ (with no index.html inside) and dir.html, in which case we want to prioritize serving dir.html over the builtin directory listing for dir/?

If it's the latter, I would propose -- when Pelican receives GET path, it:

first checks if path is a file or a directory (or it doesn't exist)
if it's a file (that exists), doesn't do anything crazy and serves it as-is (which solves my first case in that it would directly serve /static/ship.jpg)
if it's a dir and path/index.html exists, it again just serves it as-is (which solves my second case in that it would directly serve /css/components/test/)
if all else fails, it tries serving path.html (if it exists)
otherwise, it again serves path as-is, resulting in either a 404 or a builtin directory listing

Any holes in my logic? I think this could handle all corner cases, doesn't need any complex / stripping, follows the behavior of the builtin http.server as much as possible and doesn't do any unnecessary work for sites that already have the expected filesystem layout.

(edit: added step 5)

oulenz commented 6 years ago

So, why exactly do we want to explicitly prioritize serving dir/index.html over dir/ when serving a directory "just works"?

The reason it "just works" is because if path is a directory, the builtin SimpleHTTPRequestHandler.send_head will also first try /index.html and /index.htm. So in this case the Pelican server is not doing any extra work by trying /index.html. The advantage of doing it in the Pelican server is to have the entire logic in one place.

A second reason why we shouldn't leave this to the builtin server is that it will issue a hard redirect to path/ if you send in path and path is a directory. A hard redirect that your browser will remember forever (until you manually reset its cache) so that from now on, it will always request path/, even if you tell it to request path, which can really bite you if you are still playing around with the structure of your site.

Is that to avoid displaying the builtin directory listing when index.html doesn't exist?

Or is that to disambiguate the case when there's both a dir/ (with no index.html inside) and dir.html, in which case we want to prioritize serving dir.html over the builtin directory listing for dir/?

Yes, the latter. If path is a directory, but path.html also exists, the Pelican server should serve path.html. If neither path.html nor path/index.html exist we can safely assume the user wants to link to the directory.

[...] doesn't need any complex / stripping [...]

We do need / stripping. The use case (e.g. Wordpress) is linking to foo.html or foo/index.html with foo/.

I still don't quite understand why this is such a problem. The maximum 'cost' is three extra calls to os.path.exists, plus logging. A test whether path is a directory would amount to the same as one of these calls.

If you do want to change this, I would propose the following:

if os.path.isdir(path):
    suffixes = ['.html', '/index.html', '/']
else:
    suffixes = ['', '.html']

avaris commented 6 years ago

Question: Does this need optimization? Do you have any performance problems?

A valid issue can be raised about excessive logging. Said log can be reduced to debug, I think.

mosra commented 6 years ago

it will issue a hard redirect

Oh. Fair point. We definitely don't want that.

I still don't quite understand why this is such a problem.

Do you have any performance problems?

Not a cost or perf problem, no. Yes, excessive logging is a problem. If I reload my site and it looks weird, I'm usually going to the logs to see what it served for what request. If the output is spammed with this info, I'll have a hard time figuring out the root cause because what I'm looking for is drowned among all these. Besides that, the current order where it does the obvious thing as last makes for an impression like there's something fundamentally broken inside and it all works just by an accident.

If you do want to change this, I would propose the following:

Yes, please, fixing it at least for the files, because that's majority of requests and appending / or /index.html to a file just doesn't make any sense.

I guess I can live with it trying dir.html first (and failing and logging that) and then serving correct dir/index.html (having suffixes = ['/index.html', '.html', '/'] would make it :sparkles: Ideal™ :sparkles: for me).

avaris commented 6 years ago

Just lower the logging level :)

PS: ~If my memory serves right, this was added because some folks wanted URLs without .html. So, in case of GET foo, it should fetch foo.html not foo/index.html and GET foo/ would fetch foo/index.html. And looking at the original PR, I'm guessing this is now broken, right @oulenz? GET foo/ will again return foo.html?~ Yeah, my memory is bad. They just wanted to be able to get foo.html even for GET foo/.

mosra commented 6 years ago

Just lower the logging level :)

I need the debug logging level for other things.

I'll submit a PR that doesn't attempt to append / or /index.html to a file path. Because that doesn't make any sense. Is that okay? :)

GET foo/ would fetch foo/index.html

Yes, that's the exact behavior I'm after :heart_eyes:

bryanbrattlof commented 6 years ago

What about adding a check for the existence of an extension before checking for existing paths? If the server is looking for /favicon.ico then do we really need to check for /favicon.ico.html, /favicon.ico/index.html, and finally /favicon.ico?

avaris commented 6 years ago

OK, I'll bite.

    SUFFIXES = ['', '.html']

    # ...

    def get_path_that_exists(self, original_path):
        # Try to strip trailing slash
        original_path = original_path.rstrip('/')
        # Try to detect file by applying various suffixes
        for suffix in self.SUFFIXES:
            path = original_path + suffix
            if os.path.isfile(self.translate_path(path)):
                return path
            logging.info("Tried to find `%s`, but it doesn't exist.", path)
        if os.path.isdir(self.translate_path(original_path)):
            return original_path + '/'
        return None

We don't need to look for foo/index.html explicitly, since base class does that.

bryanbrattlof commented 6 years ago

I'm pretty new to the code, but would something like this work?

    def get_path_that_exists(self, original_path):                                                                                                                                                     
        # Try to strip trailing slash                                                                                                                                                                  
        original_path = original_path.rstrip('/')                                                                                                                                                      

        # Avoid looking as suffixes if we need a file                                                                                                                                                  
        if '.' in original_path:                                                                                                                                                                       
            if os.path.exists(self.translate_path(original_path)):                                                                                                                                     
                return original_path                                                                                                                                                                   
            return None                                                                                                                                                                                

        # Try to detect file by applying various suffixes                                                                                                                                              
        for suffix in self.SUFFIXES:                                                                                                                                                                   
            path = original_path + suffix                                                                                                                                                              
            if os.path.exists(self.translate_path(path)):                                                                                                                                              
                return path                                                                                                                                                                            
            logging.info("Tried to find `%s`, but it doesn't exist.", path)                                                                                                                            
        return None

I guess I'm not seeing how SUFFIXES would help when looking for a specific file?

(edit: I had it right the first time)

avaris commented 6 years ago

@bryanbrattlof . in path is not necessarily an indicator of extension, or file even.

oulenz commented 6 years ago

I've submitted a PR, but the version by @avaris would work for me too.

bryanbrattlof commented 6 years ago

You're right @avaris, that's my bad. I'm new at developing stuff for the web, and assumed (wrongly) that . was a reserved character.

(Edit: a word)

oulenz commented 6 years ago

@bryanbrattlof this isn't strictly web-related. These are just file paths on your own operating system. On unix systems, file and directory names can basically contain any unicode character other than / and NULL.

mosra commented 5 years ago

Commented on the PR by @oulenz (thank you!), but coming back to this after a few days I think the best solution here would be making the suffixes (and their order) configurable, bcause there's no single solution that fits all. In particular, there are two conflicting cases in case of GET path:

having path/ served before path.html
having path.html served before path/

Both cases above are equally valid, but at the moment Pelican prefers Case 2 and giving no possibility to do Case 1.

So I propose doing no change in the logging but instead introducing a new SERVER_PATH_SUFFIXES setting (name suggestions welcome) that would replace pelican.server.ComplexHTTPRequestHandler.SUFFIXES and default to the following to avoid breaking backwards compatibility:

SERVER_PATH_SUFFIXES = ['.html', '/index.html', '/', '']

Users then could replace this with e.g. ['', '.html'] to enable the Case 1 from above (and I'll set it myself to just [''] because that's all I need).

What do you think? (I can submit a PR once we agree on this.)

avaris commented 5 years ago

Well, now I have an issue with logging (blaming @mosra for bringing this to my attention :)). I think logging after each failure until a match found is quite noisy. My proposal: store all the failed tries and log if and only if nothing matches. Successful GETs will be pretty clean and in case of 404 you'll see what was tried.

I don't know about the setting though. This will be a fairly obscure config.

FWIW, I didn't test it but if you are using pelican -l, (theoretically) you should be able to do this in the settings file:

from pelican.server import ComplexHTTPRequestHandler
ComplexHTTPRequestHandler.SUFFIXES = ['what', 'ever', 'order', 'you', 'want']

mosra commented 5 years ago

blaming @mosra for bringing this to my attention :)

:laughing:

Yes, I think logging only on total match failure is a good idea. :+1: In addition I think the Found message is quite superflous too, since the GET message from http.server will print that anyway:

-> Found `/static/m-dark.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-dark.css HTTP/1.1" 200 -
-> Found `/static/m-debug.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-debug.css HTTP/1.1" 200 -
-> Found `/static/m-theme-dark.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-theme-dark.css HTTP/1.1" 200 -
-> Found `/static/m-grid.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-grid.css HTTP/1.1" 200 -
-> Found `/static/m-components.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-components.css HTTP/1.1" 200 -
-> Found `/static/m-layout.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-layout.css HTTP/1.1" 200 -

EDIT: oh, and the monkey-patching from within settings is great too (but your proposed cleanup of the logs is a better solution neverthleess). The public setting can wait until someone will really need to have path.html being served after path/, until then it's only solving a theoretical issue.

oulenz commented 5 years ago

I'm fine with further reducing/removing logging.

@mosra do you have a situation where you have both path.html and path/, and want to use path to refer to path/? Because if this is only a theoretical issue, I'm not sure it's worth introducing a new setting for. FWIW, github pages will redirect you to path.html in that case. (But of course there may be users who configure their server differently.)

mosra commented 5 years ago

Because if this is only a theoretical issue, I'm not sure it's worth introducing a new setting for.

@oulenz no, I personally don't have. It was just an alternative idea that could also lead to reduced log output. As I said above:

The public setting can wait until someone will really need to have path.html being served after path/, until then it's only solving a theoretical issue.

In other words, I'm all for reducing logging and no we don't need to introduce the setting because reduced logging will solve this particular issue as well.

getpelican / pelican

ComplexHTTPRequestHandler is doing excessive amount of work for non-complex cases #2456