usage examples - Githubissues

nwtn commented 11 years ago

It would be awesome to see some typical examples of how to search through these data for specific tags, etc. It could eliminate a barrier to use.

marcoscaceres commented 11 years ago

@nwtn agreed. Do you want to volunteer to do them? They are just a few simple greps from the command line.

nwtn commented 11 years ago

sure

marcoscaceres commented 11 years ago

Here is a somewhat crappy one:

find ./ -print | xargs grep -l picturefill

marcoscaceres commented 11 years ago

Better one, thanks to @yoavweiss. Finds "apple-touch-icon"s in the HTML files, and spits out the count

find ./ -name "*ml.txt" |  xargs grep -l apple-touch-icon | wc -l

oli commented 10 years ago

If you want to use the included tools, check out:

Webdevdata/webdevdata-tools

These tools produce comma-separated output, with one line per matched page.

wdd_select [-atrs=attr1,attr2...] [CSS selector] [file]
wdd_meta_names [file]
wdd_html_manifest [file]
wdd_tag_count [file]

baptistelebail/webdevdata.org

These produce semi-colon-separated summaries. For webdevdata-query.sh this includes:

CSS Query used
Total number of instances
Total number of pages with feature
Max number of instances per page

Refer to the Wiki page for details and examples

webdevdata-query.sh
webdevdata-stats-HTML-attributes.sh
webdevdata-stats-HTML-tags.sh

webdevdata-query.sh took about 40 minutes per pass on the 2013-10 dataset for me, regardless of the number of CSS-like query terms, so if you’re querying multiple things (e.g. all the sectioning elements) list them all in one query, e.g.:

./webdevdata-query.sh webdevdata.org-2013-10-30-231036 body article section nav h1 h2 h3 h4 h5 h6 hgroup main

HTH

marcoscaceres commented 10 years ago

@oli, so I think what we are going to do is allow each repo provide examples of it's own usage. The front page of webdevdata.org is currently very poorly maintained :(

oli commented 10 years ago

@marcoscaceres here are some more to get you going then:

Count the number of files containing <html (-i: case-insensitive, -l: stop on first match (faster), via xargs to grep content not filenames):

find ./ -type f | xargs grep -il "<html" | wc -l

Find files with the extension ".assembler" (-name), then execute head on each one ('{}' +) to display the first two lines (-n 2):

find ./ -type f -name "*.assembler" -exec head -n 2 '{}' +

Count the number of non-header files in the corpus that are less than 100 characters (-size -100c):

find ./ -type f -not -name "*.hdr.txt" -size -100c | wc -l

HTH!

marcoscaceres commented 10 years ago

@oli super helpful! Thanks so much for all these! Ok, we now have a pretty good set to show how this all works.

Will probably just start by collating all these and adding them to the README.

Webdevdata / webdevdata.org

usage examples #3

Webdevdata/webdevdata-tools

baptistelebail/webdevdata.org