Closed mosra closed 5 years ago
To answer your last question: if you try ''
first, then looking for '/index.html'
later is pointless because if 'foo/index.html'
exists, then so does 'foo'
.
I am not convinced the 'unnecessary' attempts to get a resource are problematic in terms of performance, especially since this is just a test server used during development.
However, one thing we could do is to simplify to just checking ''
if the requested path already contains an extension. That would only negatively affect directory names with internal .
, but that's an edge case I would be happy to ignore.
if you try '' first, then looking for '/index.html' later is pointless
Thinking outside of the box: do we actually need to look for /index.html
? As far as my experience goes with plain python -m http.server
, it was never a problem. Assuming I have a file at css/index.html
:
if I ask for http://localhost:8000/css
, the log of python -m http.server
shows
127.0.0.1 - - [16/Nov/2018 20:21:02] "GET /css HTTP/1.1" 301 -
127.0.0.1 - - [16/Nov/2018 20:21:02] "GET /css/ HTTP/1.1" 200 -
and serves me the index.html
correctly.
if I ask for http://localhost:8000/css/
, the log of python -m http.server
shows
127.0.0.1 - - [16/Nov/2018 20:25:36] "GET /css/ HTTP/1.1" 200 -
and serves me the index.html
correctly.
if I ask for http://localhost:8000/css/index.html
, the log of python -m http.server
shows
127.0.0.1 - - [16/Nov/2018 20:26:55] "GET /css/index.html HTTP/1.1" 200 -
and again serves me the index.html
correctly.
So, why exactly do we want to explicitly prioritize serving dir/index.html
over dir/
when serving a directory "just works"?
index.html
doesn't exist?dir/
(with no index.html
inside) and dir.html
, in which case we want to prioritize serving dir.html
over the builtin directory listing for dir/
?If it's the latter, I would propose -- when Pelican receives GET path
, it:
path
is a file or a directory (or it doesn't exist)/static/ship.jpg
)path/index.html
exists, it again just serves it as-is (which solves my second case in that it would directly serve /css/components/test/
)path.html
(if it exists)path
as-is, resulting in either a 404 or a builtin directory listingAny holes in my logic? I think this could handle all corner cases, doesn't need any complex /
stripping, follows the behavior of the builtin http.server
as much as possible and doesn't do any unnecessary work for sites that already have the expected filesystem layout.
(edit: added step 5)
So, why exactly do we want to explicitly prioritize serving
dir/index.html
overdir/
when serving a directory "just works"?
The reason it "just works" is because if path
is a directory, the builtin SimpleHTTPRequestHandler.send_head
will also first try /index.html
and /index.htm
. So in this case the Pelican server is not doing any extra work by trying /index.html
. The advantage of doing it in the Pelican server is to have the entire logic in one place.
A second reason why we shouldn't leave this to the builtin server is that it will issue a hard redirect to path/
if you send in path
and path
is a directory. A hard redirect that your browser will remember forever (until you manually reset its cache) so that from now on, it will always request path/
, even if you tell it to request path
, which can really bite you if you are still playing around with the structure of your site.
- Is that to avoid displaying the builtin directory listing when
index.html
doesn't exist?- Or is that to disambiguate the case when there's both a
dir/
(with noindex.html
inside) anddir.html
, in which case we want to prioritize servingdir.html
over the builtin directory listing fordir/
?
Yes, the latter. If path
is a directory, but path.html
also exists, the Pelican server should serve path.html
. If neither path.html
nor path/index.html
exist we can safely assume the user wants to link to the directory.
[...] doesn't need any complex
/
stripping [...]
We do need /
stripping. The use case (e.g. Wordpress) is linking to foo.html
or foo/index.html
with foo/
.
I still don't quite understand why this is such a problem. The maximum 'cost' is three extra calls to os.path.exists
, plus logging. A test whether path
is a directory would amount to the same as one of these calls.
If you do want to change this, I would propose the following:
if os.path.isdir(path):
suffixes = ['.html', '/index.html', '/']
else:
suffixes = ['', '.html']
Question: Does this need optimization? Do you have any performance problems?
A valid issue can be raised about excessive logging. Said log can be reduced to debug
, I think.
it will issue a hard redirect
Oh. Fair point. We definitely don't want that.
I still don't quite understand why this is such a problem.
Do you have any performance problems?
Not a cost or perf problem, no. Yes, excessive logging is a problem. If I reload my site and it looks weird, I'm usually going to the logs to see what it served for what request. If the output is spammed with this info, I'll have a hard time figuring out the root cause because what I'm looking for is drowned among all these. Besides that, the current order where it does the obvious thing as last makes for an impression like there's something fundamentally broken inside and it all works just by an accident.
If you do want to change this, I would propose the following:
Yes, please, fixing it at least for the files, because that's majority of requests and appending /
or /index.html
to a file just doesn't make any sense.
I guess I can live with it trying dir.html
first (and failing and logging that) and then serving correct dir/index.html
(having suffixes = ['/index.html', '.html', '/']
would make it :sparkles: Ideal™ :sparkles: for me).
Just lower the logging level :)
PS: ~If my memory serves right, this was added because some folks wanted URLs without .html
. So, in case of GET foo
, it should fetch foo.html
not foo/index.html
and GET foo/
would fetch foo/index.html
. And looking at the original PR, I'm guessing this is now broken, right @oulenz? GET foo/
will again return foo.html
?~
Yeah, my memory is bad. They just wanted to be able to get foo.html
even for GET foo/
.
Just lower the logging level :)
I need the debug logging level for other things.
I'll submit a PR that doesn't attempt to append /
or /index.html
to a file path. Because that doesn't make any sense. Is that okay? :)
GET foo/
would fetchfoo/index.html
Yes, that's the exact behavior I'm after :heart_eyes:
What about adding a check for the existence of an extension before checking for existing paths? If the server is looking for /favicon.ico
then do we really need to check for /favicon.ico.html
, /favicon.ico/index.html
, and finally /favicon.ico
?
OK, I'll bite.
SUFFIXES = ['', '.html']
# ...
def get_path_that_exists(self, original_path):
# Try to strip trailing slash
original_path = original_path.rstrip('/')
# Try to detect file by applying various suffixes
for suffix in self.SUFFIXES:
path = original_path + suffix
if os.path.isfile(self.translate_path(path)):
return path
logging.info("Tried to find `%s`, but it doesn't exist.", path)
if os.path.isdir(self.translate_path(original_path)):
return original_path + '/'
return None
We don't need to look for foo/index.html
explicitly, since base class does that.
I'm pretty new to the code, but would something like this work?
def get_path_that_exists(self, original_path):
# Try to strip trailing slash
original_path = original_path.rstrip('/')
# Avoid looking as suffixes if we need a file
if '.' in original_path:
if os.path.exists(self.translate_path(original_path)):
return original_path
return None
# Try to detect file by applying various suffixes
for suffix in self.SUFFIXES:
path = original_path + suffix
if os.path.exists(self.translate_path(path)):
return path
logging.info("Tried to find `%s`, but it doesn't exist.", path)
return None
I guess I'm not seeing how SUFFIXES
would help when looking for a specific file?
(edit: I had it right the first time)
@bryanbrattlof .
in path is not necessarily an indicator of extension, or file even.
I've submitted a PR, but the version by @avaris would work for me too.
You're right @avaris, that's my bad. I'm new at developing stuff for the web, and assumed (wrongly) that .
was a reserved character.
(Edit: a word)
@bryanbrattlof this isn't strictly web-related. These are just file paths on your own operating system. On unix systems, file and directory names can basically contain any unicode character other than /
and NULL
.
Commented on the PR by @oulenz (thank you!), but coming back to this after a few days I think the best solution here would be making the suffixes (and their order) configurable, bcause there's no single solution that fits all. In particular, there are two conflicting cases in case of GET path
:
path/
served before path.html
path.html
served before path/
Both cases above are equally valid, but at the moment Pelican prefers Case 2 and giving no possibility to do Case 1.
So I propose doing no change in the logging but instead introducing a new SERVER_PATH_SUFFIXES
setting (name suggestions welcome) that would replace pelican.server.ComplexHTTPRequestHandler.SUFFIXES
and default to the following to avoid breaking backwards compatibility:
SERVER_PATH_SUFFIXES = ['.html', '/index.html', '/', '']
Users then could replace this with e.g. ['', '.html']
to enable the Case 1 from above (and I'll set it myself to just ['']
because that's all I need).
What do you think? (I can submit a PR once we agree on this.)
Well, now I have an issue with logging (blaming @mosra for bringing this to my attention :)). I think logging after each failure until a match found is quite noisy. My proposal: store all the failed tries and log if and only if nothing matches. Successful GET
s will be pretty clean and in case of 404
you'll see what was tried.
I don't know about the setting though. This will be a fairly obscure config.
FWIW, I didn't test it but if you are using pelican -l
, (theoretically) you should be able to do this in the settings file:
from pelican.server import ComplexHTTPRequestHandler
ComplexHTTPRequestHandler.SUFFIXES = ['what', 'ever', 'order', 'you', 'want']
blaming @mosra for bringing this to my attention :)
:laughing:
Yes, I think logging only on total match failure is a good idea. :+1: In addition I think the Found
message is quite superflous too, since the GET
message from http.server
will print that anyway:
-> Found `/static/m-dark.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-dark.css HTTP/1.1" 200 -
-> Found `/static/m-debug.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-debug.css HTTP/1.1" 200 -
-> Found `/static/m-theme-dark.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-theme-dark.css HTTP/1.1" 200 -
-> Found `/static/m-grid.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-grid.css HTTP/1.1" 200 -
-> Found `/static/m-components.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-components.css HTTP/1.1" 200 -
-> Found `/static/m-layout.css`.
127.0.0.1 - - [26/Nov/2018 10:10:38] "GET /static/m-layout.css HTTP/1.1" 200 -
EDIT: oh, and the monkey-patching from within settings is great too (but your proposed cleanup of the logs is a better solution neverthleess). The public setting can wait until someone will really need to have path.html
being served after path/
, until then it's only solving a theoretical issue.
I'm fine with further reducing/removing logging.
@mosra do you have a situation where you have both path.html
and path/
, and want to use path
to refer to path/
? Because if this is only a theoretical issue, I'm not sure it's worth introducing a new setting for. FWIW, github pages will redirect you to path.html
in that case. (But of course there may be users who configure their server differently.)
Because if this is only a theoretical issue, I'm not sure it's worth introducing a new setting for.
@oulenz no, I personally don't have. It was just an alternative idea that could also lead to reduced log output. As I said above:
The public setting can wait until someone will really need to have
path.html
being served afterpath/
, until then it's only solving a theoretical issue.
In other words, I'm all for reducing logging and no we don't need to introduce the setting because reduced logging will solve this particular issue as well.
Spinning up a Pelican server after #2324 was merged, every static request prints the following in the log:
And in case of pages it's like this:
To me that seems quite excessive, since the
GET
request is already containing the correct path and so trying that first (as it was before #2324) makes the most sense (to me at least). In other words, I have the layout done in a way that doesn't require any complex handling -- the files are just where a simpleGET
expects them to be.Seeing the amount of PRs submitted just for this alone I realize this is a mine-field and every fix breaks someone's else workflow. Since I didn't find a clear reason in the PR descirption, my question is, @oulenz (since you submitted the aforementioned PR), what exactly was the reason to not try the
''
suffix first (the ideal case, basically) and what would break if it did try that first?(Looks like this is my first issue for this project ever, interesting.)