Closed nwtn closed 10 years ago
The current dir structure is an improvement of stocking all the HTML in a single dir, since Linux (and probably other OSes as well) choke when they have too many files in a single dir (for Linux's Ext4 it's 32K). Splitting the dirs further won't hurt anything.
I'd go with the main page's slugified URL instead of a domain name, since CSS/JS/other can be served from a different domain.
Right, that makes sense.
I was initially thinking of only grabbing local (same domain) JS and CSS resources, as a compromise to try to avoid grabbing, e.g. 750k copies of jQuery.
If we're grabbing JS, we probably should also grab 750K copies of jquery, otherwise the results might be biased.
Hmm...you're right. I mean, it should be relatively easy to determine common libraries from just the filenames, but dropping all externally hosted resources would definitely be a problem. I will abandon that idea.
Based on Twitter discussion re: grabbing more pages + css and js resources for each domain, I was thinking of making a separate dir for each domain. Before I made a change, though, I was wondering what the reasoning was behind the current dir structure and if changing it would mess anything up.