Add functionality to sampler to take random samples within a datetime range

@scgordon got in touch by email and had a great idea that shouldn't take much time at all to implement. What if we could use the sampler to ask the question: "How has average metadata quality within a repository changed over time?" i.e. is LTER producing higher quality metadata, on average, than it used to?

Modifications to make:

[x] Add a command line switch or set of switches so a begin and end datetime can be specified

e.g.,

python2 sample-metadata.py --from 20120101 --to 20130101

[x] Modify the result directory so it looks like this when the above switches are set:

result
{repo_x}
  {time_range_y}
    {dialect_z}

e.g.,

.
result/
KNB/
  20120101_20130101/
    EML/
    FGDC/
  20130101_20140101/
    EML/
    ISO/
    DRYAD/

Note that this approach will produce y random samples, each independent of one another, within each time period y. Thus, we can only ask how metadata quality changes, on average, over time. In a future revision of this, we could find a way to look at how the documents we see today have changed over time, which is likely to be much more interesting question. This would involve following obsoletion/obsoletesBy chains over time.

I was able to get this working. We can still tweak the way things work so let me know!

Documents can be sampled from within a datetime range via the --from and --to switches
All EML versions now go in the same sub-folder.

The output still looks like this:

python sample-metadata.py --sample-size 1 --from 2016-01-01T00:00:00.000Z --to 2016-03-01T00:00:00.000Z

result
├── DRYAD
│   ├── Dryad_Metadata_Application_Profile_Version_3.1
│   │   └── xml
│   │       └── 00000-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00000-sysmeta.xml
├── GOA
│   ├── EML
│   │   └── xml
│   │       └── 00008-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00008-sysmeta.xml
├── KNB
│   ├── EML
│   │   └── xml
│   │       └── 00004-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00004-sysmeta.xml
├── LTER
│   ├── EML
│   │   └── xml
│   │       └── 00001-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00001-sysmeta.xml
├── LTER_EUROPE
│   ├── EML
│   │   └── xml
│   │       └── 00003-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00003-sysmeta.xml
├── NRDC
│   ├── Geographic_MetaData_(GMD)_Extensible_Markup_Language
│   │   └── xml
│   │       └── 00005-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00005-sysmeta.xml
├── PPBIO
│   ├── EML
│   │   └── xml
│   │       └── 00007-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00007-sysmeta.xml
├── TERN
│   ├── EML
│   │   └── xml
│   │       └── 00002-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00002-sysmeta.xml
├── TFRI
│   ├── EML
│   │   └── xml
│   │       └── 00006-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00006-sysmeta.xml
├── documents.csv
├── sampled_documents.csv
└── statistics.csv

To get multiple random samples, as you want to do, you'd need to run this command once for each sample, copying the result folder somewhere else, like...

python sample-metadata.py ...
cp -r result result-period1
python sample-metadata.py ...
cp -r result result-period2
python sample-metadata.py ...
cp -r result result-period3
python sample-metadata.py ...
cp -r result result-period4
python sample-metadata.py ...
cp -r result result-period5

It's not super efficient but it should work.

NCEAS / metadig

Add functionality to sampler to take random samples within a datetime range #43