Closed amoeba closed 7 years ago
I was able to get this working. We can still tweak the way things work so let me know!
--from
and --to
switchesThe output still looks like this:
python sample-metadata.py --sample-size 1 --from 2016-01-01T00:00:00.000Z --to 2016-03-01T00:00:00.000Z
result
├── DRYAD
│ ├── Dryad_Metadata_Application_Profile_Version_3.1
│ │ └── xml
│ │ └── 00000-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00000-sysmeta.xml
├── GOA
│ ├── EML
│ │ └── xml
│ │ └── 00008-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00008-sysmeta.xml
├── KNB
│ ├── EML
│ │ └── xml
│ │ └── 00004-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00004-sysmeta.xml
├── LTER
│ ├── EML
│ │ └── xml
│ │ └── 00001-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00001-sysmeta.xml
├── LTER_EUROPE
│ ├── EML
│ │ └── xml
│ │ └── 00003-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00003-sysmeta.xml
├── NRDC
│ ├── Geographic_MetaData_(GMD)_Extensible_Markup_Language
│ │ └── xml
│ │ └── 00005-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00005-sysmeta.xml
├── PPBIO
│ ├── EML
│ │ └── xml
│ │ └── 00007-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00007-sysmeta.xml
├── TERN
│ ├── EML
│ │ └── xml
│ │ └── 00002-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00002-sysmeta.xml
├── TFRI
│ ├── EML
│ │ └── xml
│ │ └── 00006-metadata.xml
│ └── sysmeta
│ └── xml
│ └── 00006-sysmeta.xml
├── documents.csv
├── sampled_documents.csv
└── statistics.csv
To get multiple random samples, as you want to do, you'd need to run this command once for each sample, copying the result
folder somewhere else, like...
python sample-metadata.py ...
cp -r result result-period1
python sample-metadata.py ...
cp -r result result-period2
python sample-metadata.py ...
cp -r result result-period3
python sample-metadata.py ...
cp -r result result-period4
python sample-metadata.py ...
cp -r result result-period5
It's not super efficient but it should work.
@scgordon got in touch by email and had a great idea that shouldn't take much time at all to implement. What if we could use the sampler to ask the question: "How has average metadata quality within a repository changed over time?" i.e. is LTER producing higher quality metadata, on average, than it used to?
Modifications to make:
[x] Add a command line switch or set of switches so a begin and end datetime can be specified
e.g.,
python2 sample-metadata.py --from 20120101 --to 20130101
[x] Modify the
result
directory so it looks like this when the above switches are set:e.g.,
Note that this approach will produce
y
random samples, each independent of one another, within each time periody
. Thus, we can only ask how metadata quality changes, on average, over time. In a future revision of this, we could find a way to look at how the documents we see today have changed over time, which is likely to be much more interesting question. This would involve followingobsoletion
/obsoletesBy
chains over time.