Open nwtn opened 11 years ago
@nwtn agreed. Do you want to volunteer to do them? They are just a few simple greps from the command line.
sure
Here is a somewhat crappy one:
find ./ -print | xargs grep -l picturefill
Better one, thanks to @yoavweiss. Finds "apple-touch-icon"s in the HTML files, and spits out the count
find ./ -name "*ml.txt" | xargs grep -l apple-touch-icon | wc -l
If you want to use the included tools, check out:
These tools produce comma-separated output, with one line per matched page.
wdd_select [-atrs=attr1,attr2...] [CSS selector] [file]
wdd_meta_names [file]
wdd_html_manifest [file]
wdd_tag_count [file]
These produce semi-colon-separated summaries. For webdevdata-query.sh
this includes:
Refer to the Wiki page for details and examples
webdevdata-query.sh
webdevdata-stats-HTML-attributes.sh
webdevdata-stats-HTML-tags.sh
webdevdata-query.sh
took about 40 minutes per pass on the 2013-10 dataset for me, regardless of the number of CSS-like query terms, so if you’re querying multiple things (e.g. all the sectioning elements) list them all in one query, e.g.:
./webdevdata-query.sh webdevdata.org-2013-10-30-231036 body article section nav h1 h2 h3 h4 h5 h6 hgroup main
HTH
@oli, so I think what we are going to do is allow each repo provide examples of it's own usage. The front page of webdevdata.org is currently very poorly maintained :(
@marcoscaceres here are some more to get you going then:
<html
(-i
: case-insensitive, -l
: stop on first match (faster), via xargs
to grep content not filenames):find ./ -type f | xargs grep -il "<html" | wc -l
-name
), then execute head
on each one ('{}' +
) to display the first two lines (-n 2
):find ./ -type f -name "*.assembler" -exec head -n 2 '{}' +
-size -100c
):find ./ -type f -not -name "*.hdr.txt" -size -100c | wc -l
HTH!
@oli super helpful! Thanks so much for all these! Ok, we now have a pretty good set to show how this all works.
Will probably just start by collating all these and adding them to the README.
It would be awesome to see some typical examples of how to search through these data for specific tags, etc. It could eliminate a barrier to use.