NCEAS / metadig

Approaches and tools for Metadata Improvement and Guidance.
Apache License 2.0
7 stars 0 forks source link

Add functionality to sampler to take random samples within a datetime range #43

Closed amoeba closed 7 years ago

amoeba commented 8 years ago

@scgordon got in touch by email and had a great idea that shouldn't take much time at all to implement. What if we could use the sampler to ask the question: "How has average metadata quality within a repository changed over time?" i.e. is LTER producing higher quality metadata, on average, than it used to?

Modifications to make:

Note that this approach will produce y random samples, each independent of one another, within each time period y. Thus, we can only ask how metadata quality changes, on average, over time. In a future revision of this, we could find a way to look at how the documents we see today have changed over time, which is likely to be much more interesting question. This would involve following obsoletion/obsoletesBy chains over time.

amoeba commented 7 years ago

I was able to get this working. We can still tweak the way things work so let me know!

  1. Documents can be sampled from within a datetime range via the --from and --to switches
  2. All EML versions now go in the same sub-folder.

The output still looks like this:

python sample-metadata.py --sample-size 1 --from 2016-01-01T00:00:00.000Z --to 2016-03-01T00:00:00.000Z
result
├── DRYAD
│   ├── Dryad_Metadata_Application_Profile_Version_3.1
│   │   └── xml
│   │       └── 00000-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00000-sysmeta.xml
├── GOA
│   ├── EML
│   │   └── xml
│   │       └── 00008-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00008-sysmeta.xml
├── KNB
│   ├── EML
│   │   └── xml
│   │       └── 00004-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00004-sysmeta.xml
├── LTER
│   ├── EML
│   │   └── xml
│   │       └── 00001-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00001-sysmeta.xml
├── LTER_EUROPE
│   ├── EML
│   │   └── xml
│   │       └── 00003-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00003-sysmeta.xml
├── NRDC
│   ├── Geographic_MetaData_(GMD)_Extensible_Markup_Language
│   │   └── xml
│   │       └── 00005-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00005-sysmeta.xml
├── PPBIO
│   ├── EML
│   │   └── xml
│   │       └── 00007-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00007-sysmeta.xml
├── TERN
│   ├── EML
│   │   └── xml
│   │       └── 00002-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00002-sysmeta.xml
├── TFRI
│   ├── EML
│   │   └── xml
│   │       └── 00006-metadata.xml
│   └── sysmeta
│       └── xml
│           └── 00006-sysmeta.xml
├── documents.csv
├── sampled_documents.csv
└── statistics.csv

To get multiple random samples, as you want to do, you'd need to run this command once for each sample, copying the result folder somewhere else, like...

python sample-metadata.py ...
cp -r result result-period1
python sample-metadata.py ...
cp -r result result-period2
python sample-metadata.py ...
cp -r result result-period3
python sample-metadata.py ...
cp -r result result-period4
python sample-metadata.py ...
cp -r result result-period5

It's not super efficient but it should work.