script to pull sample metadata

mbjones commented 9 years ago

Produce a sample from DataONE metadata collections

Which DataONE Member Nodes? All metadata from small nodes, and random sample from the larger nodes
Target: up to 250 random documents from each node
[x] Write script to pull random sample from nodes, with flag for most current versions

amoeba commented 9 years ago

Made some good progress on this today in 52e1fdf4de81d2d2b3499738b7c9fb27e4d116ed. Next steps include:

Add error handling and fall back to CN when MN can't be reached
Add command line arguments to allow flexibility, including specifying a single MN to sample from
Improve routine that generates the sub directory names. The current version doesn't handle forward slashes well.
The work can be done in parallel but is not right now
Add (more) error handling throughout most of the script

lpowers67 commented 9 years ago

Hi Bryce, Looking forward to working with you and thanks for getting a jump on this. Cheers, Lindsay

Lindsay Powers Deputy Director of Earth Science The HDF Group +1.720.635.5740 lpowers@hdfgroup.orgmailto:lpowers@hdfgroup.org

On Jul 21, 2015, at 19:06, Bryce Mecum notifications@github.com<mailto:notifications@github.com> wrote:

Made some good progress on this today in 52e1fdfhttps://github.com/NCEAS/metadig/commit/52e1fdf4de81d2d2b3499738b7c9fb27e4d116ed. Next steps include:

Add error handling and fall back to CN when MN can't be reached
Add command line arguments to allow flexibility, including specifying a single MN to sample from
Improve routine that generates the sub directory names. The current version doesn't handle forward slashes well.
The work can be done in parallel but is not right now
Add (more) error handling throughout most of the script

Reply to this email directly or view it on GitHubhttps://github.com/NCEAS/metadig/issues/2#issuecomment-123522044.

amoeba commented 9 years ago

Thanks Lindsay. I look forward to working with you as well!

amoeba commented 9 years ago

More good progress today. The script is now generates samples.

Major additions:

Command line arguments for specifying a single node to sample and a sample size
First-pass of documentation done

The design has changed to a make-like style, where program execution proceeds stepwise:

1) Get all documents of interest from the Solr index on the CN, producing documents.csv 2) Randomly sample the documents by MN, producing sampled_documents.csv 3) For each document in sampled_documents.csv, get the associated system and object metadata from the CN and save it disk with the structure ./results/mn-identifier/object-identifier/{meta|object}.xml

Next steps:

Make script use the dev API instead of stable
Do something about the case where identifier metadata request returns a 404. Is this a bug on my end?

amoeba commented 9 years ago

The initial working release of this script is available at commit bbb0e1c007c6eda8a2040606a58e3de3192df5b3. Please take it for a test run.

Key changes since last time:

Can be run against the development CN instead of production (command line switch --test)
Delay between making requests has been removed. All requests now happen in a serial fashion (one-by-one)
Greatly improved documentation (at the top of the script)
Numerous bug fixes

To run this script:

git clone https://github.com/NCEAS/metadig
cd metadig
pip install pandas
python sample-metadata.py --test

I recommend running with the --test flag as there are substantially fewer documents and member nodes on the development CN. Even with that, the script will still take some time to run in test mode. The sampling results should end up in a subdirectory of this script's folder called results and it should have the following file structure:

results/
    documents.csv # All results from the solr index on the CN
    sampled_documents.csv # Table of all the sampled documents to be retreived
    statistics.csv # Table of documents per member node
    {node_identifier-1}/
        00001-meta.xml
        00001-object.xml
        ...
    {node-identifier-2}/
    ...

Where {node-identifier-X} refers to the member node identifiers available on the CN (production or development).

Please get in touch if you have questions or suggestions for improving the script. Thanks!

amoeba commented 9 years ago

Found some more bugs with Matt this morning so if the script was downloaded prior to 12:30 PST I'd recommend a re-download / re-clone. The fixes are available at commit bbb0e1c007c6eda8a2040606a58e3de3192df5b3

As of now, these two commands work:

python sample-metadata.py -s 5
python sample-metadata.py --test -s 5

tedhabermann commented 9 years ago

Bryce et al.,

Great work - excited to start on our end. The directory structure we use looks like collection/dialect/xml where collection is a collection identifier or name, dialect is the metadata dialect (FGDC, EML, or ISO) and the xml directory is the place where the XML samples live. It would be great if we could match this structure as the samples are created. This is similar to the structure shown at https://geo-ide.noaa.gov/wiki/index.php?title=Category:ISO_Building_Blocks#Organizing_Tools_and_Files but collection = project... Would make it easier for us...

Thanks, Ted

amoeba commented 9 years ago

Hi Ted, I'm happy to change the script for your needs.

I went ahead and updated the script, see 30d924eadb72dd0798add7228eb03989eb6ce920, to reflect my interpretation of your request. Hopefully I got it right. I ran the script against the live DataOne site as well as our test environment to make sure it worked and it all looks good minus one caveat.

It would be nice to have clean names for subdirectories for each metadata dialect, i.e. EML-2.1.0 but I didn't have the time just now to make the script do that. For now, it makes the metadata dialect subfolder named by the URI for the metadata dialect, minus special characters that need to be removed because it's a filepath.

So EML 2.1.0 files are in /emlecoinformatics.orgeml-2.1.0/xml instead of a more desirable /eml-2.1.0/xml. I'll work with Matt on fixing this up soon.

mbjones commented 9 years ago

I worked on restructuring the dialect name formats (which now are like Ecological_Metadata_Language_version_2.1.0), and I did some QA checking on the node names (to work around the SOLR bug described in #24). A few remaining bugs/corner cases to be cleaned up on Monday, namely some wayward dryad documents. Now, the directory structure looks like this:

result
├── CDL
│   ├── Content_Standard_for_Digital_Geospatial_Metadata_version_001-1998
│   │   └── xml
│   │       ├── 00071-metadata.xml
│   │       ├── 00072-metadata.xml
│   │       ├── 00073-metadata.xml
│   │       ├── 00074-metadata.xml
│   │       └── 00075-metadata.xml
│   └── sysmeta
│       └── xml
│           ├── 00071-sysmeta.xml
│           ├── 00072-sysmeta.xml
│           ├── 00073-sysmeta.xml
│           ├── 00074-sysmeta.xml
│           └── 00075-sysmeta.xml
├── CLOEBIRD
│   ├── Ecological_Metadata_Language_version_2.1.0
│   │   └── xml
│   └── sysmeta
│       └── xml
│           └── 00005-sysmeta.xml
├── DRYAD
│   ├── Dryad_Metadata_Application_Profile_Version_3.1
│   │   └── xml
│   │       ├── 00021-metadata.xml
│   │       ├── 00022-metadata.xml
│   │       ├── 00023-metadata.xml
│   │       ├── 00024-metadata.xml
│   │       └── 00025-metadata.xml
│   └── sysmeta
│       └── xml
│           ├── 00021-sysmeta.xml
│           ├── 00022-sysmeta.xml
│           ├── 00023-sysmeta.xml
│           ├── 00024-sysmeta.xml
│           └── 00025-sysmeta.xml
├── EDACGSTORE
│   ├── Content_Standard_for_Digital_Geospatial_Metadata_version_001-1998
│   │   └── xml
│   │       ├── 00061-metadata.xml
│   │       ├── 00062-metadata.xml
│   │       ├── 00063-metadata.xml
│   │       ├── 00064-metadata.xml
│   │       └── 00065-metadata.xml
│   └── sysmeta
│       └── xml
│           ├── 00061-sysmeta.xml
│           ├── 00062-sysmeta.xml
│           ├── 00063-sysmeta.xml
│           ├── 00064-sysmeta.xml
│           └── 00065-sysmeta.xml
├── EDORA
│   ├── Oak_Ridge_National_Lab_Mercury_Metadata_version_1.0
│   │   └── xml
│   │       ├── 00051-metadata.xml
│   │       ├── 00052-metadata.xml
│   │       ├── 00053-metadata.xml
│   │       ├── 00054-metadata.xml
│   │       └── 00055-metadata.xml
│   └── sysmeta
│       └── xml
│           ├── 00051-sysmeta.xml
│           ├── 00052-sysmeta.xml
│           ├── 00053-sysmeta.xml
│           ├── 00054-sysmeta.xml
│           └── 00055-sysmeta.xml
├── ESA
│   ├── Ecological_Metadata_Language_version_2.0.1
│   │   └── xml
│   │       ├── 00119-metadata.xml
│   │       └── 00122-metadata.xml
│   ├── Ecological_Metadata_Language_version_2.1.0
│   │   └── xml
│   │       ├── 00118-metadata.xml
│   │       └── 00120-metadata.xml
│   ├── Ecological_Metadata_Language_version_2.1.1
│   │   └── xml
│   │       └── 00121-metadata.xml
│   └── sysmeta
│       └── xml
│           ├── 00118-sysmeta.xml
│           ├── 00119-sysmeta.xml
│           ├── 00120-sysmeta.xml
│           ├── 00121-sysmeta.xml
│           └── 00122-sysmeta.xml

NCEAS / metadig

script to pull sample metadata #2

Key changes since last time:

To run this script: