Webdevdata / fetcher

Tool to download website data.
The Unlicense
9 stars 4 forks source link

dir = hash.hexdigest()[:2] -- why? #4

Closed nwtn closed 10 years ago

nwtn commented 10 years ago

Based on Twitter discussion re: grabbing more pages + css and js resources for each domain, I was thinking of making a separate dir for each domain. Before I made a change, though, I was wondering what the reasoning was behind the current dir structure and if changing it would mess anything up.

yoavweiss commented 10 years ago

The current dir structure is an improvement of stocking all the HTML in a single dir, since Linux (and probably other OSes as well) choke when they have too many files in a single dir (for Linux's Ext4 it's 32K). Splitting the dirs further won't hurt anything.

I'd go with the main page's slugified URL instead of a domain name, since CSS/JS/other can be served from a different domain.

nwtn commented 10 years ago

Right, that makes sense.

I was initially thinking of only grabbing local (same domain) JS and CSS resources, as a compromise to try to avoid grabbing, e.g. 750k copies of jQuery.

yoavweiss commented 10 years ago

If we're grabbing JS, we probably should also grab 750K copies of jquery, otherwise the results might be biased.

nwtn commented 10 years ago

Hmm...you're right. I mean, it should be relatively easy to determine common libraries from just the filenames, but dropping all externally hosted resources would definitely be a problem. I will abandon that idea.