Closed mbjones closed 9 years ago
Made some good progress on this today in 52e1fdf4de81d2d2b3499738b7c9fb27e4d116ed. Next steps include:
Hi Bryce, Looking forward to working with you and thanks for getting a jump on this. Cheers, Lindsay
Lindsay Powers Deputy Director of Earth Science The HDF Group +1.720.635.5740 lpowers@hdfgroup.orgmailto:lpowers@hdfgroup.org
On Jul 21, 2015, at 19:06, Bryce Mecum notifications@github.com<mailto:notifications@github.com> wrote:
Made some good progress on this today in 52e1fdfhttps://github.com/NCEAS/metadig/commit/52e1fdf4de81d2d2b3499738b7c9fb27e4d116ed. Next steps include:
Reply to this email directly or view it on GitHubhttps://github.com/NCEAS/metadig/issues/2#issuecomment-123522044.
Thanks Lindsay. I look forward to working with you as well!
More good progress today. The script is now generates samples.
Major additions:
The design has changed to a make-like style, where program execution proceeds stepwise:
1) Get all documents of interest from the Solr index on the CN, producing documents.csv 2) Randomly sample the documents by MN, producing sampled_documents.csv 3) For each document in sampled_documents.csv, get the associated system and object metadata from the CN and save it disk with the structure ./results/mn-identifier/object-identifier/{meta|object}.xml
Next steps:
The initial working release of this script is available at commit bbb0e1c007c6eda8a2040606a58e3de3192df5b3. Please take it for a test run.
git clone https://github.com/NCEAS/metadig
cd metadig
pip install pandas
python sample-metadata.py --test
I recommend running with the --test flag as there are substantially fewer documents and member nodes on the development CN. Even with that, the script will still take some time to run in test mode. The sampling results should end up in a subdirectory of this script's folder called results
and it should have the following file structure:
results/
documents.csv # All results from the solr index on the CN
sampled_documents.csv # Table of all the sampled documents to be retreived
statistics.csv # Table of documents per member node
{node_identifier-1}/
00001-meta.xml
00001-object.xml
...
{node-identifier-2}/
...
Where {node-identifier-X} refers to the member node identifiers available on the CN (production or development).
Please get in touch if you have questions or suggestions for improving the script. Thanks!
Found some more bugs with Matt this morning so if the script was downloaded prior to 12:30 PST I'd recommend a re-download / re-clone. The fixes are available at commit bbb0e1c007c6eda8a2040606a58e3de3192df5b3
As of now, these two commands work:
python sample-metadata.py -s 5
python sample-metadata.py --test -s 5
Bryce et al.,
Great work - excited to start on our end. The directory structure we use looks like collection/dialect/xml where collection is a collection identifier or name, dialect is the metadata dialect (FGDC, EML, or ISO) and the xml directory is the place where the XML samples live. It would be great if we could match this structure as the samples are created. This is similar to the structure shown at https://geo-ide.noaa.gov/wiki/index.php?title=Category:ISO_Building_Blocks#Organizing_Tools_and_Files but collection = project... Would make it easier for us...
Thanks, Ted
Hi Ted, I'm happy to change the script for your needs.
I went ahead and updated the script, see 30d924eadb72dd0798add7228eb03989eb6ce920, to reflect my interpretation of your request. Hopefully I got it right. I ran the script against the live DataOne site as well as our test environment to make sure it worked and it all looks good minus one caveat.
It would be nice to have clean names for subdirectories for each metadata dialect, i.e. EML-2.1.0 but I didn't have the time just now to make the script do that. For now, it makes the metadata dialect subfolder named by the URI for the metadata dialect, minus special characters that need to be removed because it's a filepath.
So EML 2.1.0 files are in /emlecoinformatics.orgeml-2.1.0/xml instead of a more desirable /eml-2.1.0/xml. I'll work with Matt on fixing this up soon.
I worked on restructuring the dialect name formats (which now are like Ecological_Metadata_Language_version_2.1.0
), and I did some QA checking on the node names (to work around the SOLR bug described in #24). A few remaining bugs/corner cases to be cleaned up on Monday, namely some wayward dryad documents. Now, the directory structure looks like this:
result
├── CDL
│ ├── Content_Standard_for_Digital_Geospatial_Metadata_version_001-1998
│ │ └── xml
│ │ ├── 00071-metadata.xml
│ │ ├── 00072-metadata.xml
│ │ ├── 00073-metadata.xml
│ │ ├── 00074-metadata.xml
│ │ └── 00075-metadata.xml
│ └── sysmeta
│ └── xml
│ ├── 00071-sysmeta.xml
│ ├── 00072-sysmeta.xml
│ ├── 00073-sysmeta.xml
│ ├── 00074-sysmeta.xml
│ └── 00075-sysmeta.xml
├── CLOEBIRD
│ ├── Ecological_Metadata_Language_version_2.1.0
│ │ └── xml
│ └── sysmeta
│ └── xml
│ └── 00005-sysmeta.xml
├── DRYAD
│ ├── Dryad_Metadata_Application_Profile_Version_3.1
│ │ └── xml
│ │ ├── 00021-metadata.xml
│ │ ├── 00022-metadata.xml
│ │ ├── 00023-metadata.xml
│ │ ├── 00024-metadata.xml
│ │ └── 00025-metadata.xml
│ └── sysmeta
│ └── xml
│ ├── 00021-sysmeta.xml
│ ├── 00022-sysmeta.xml
│ ├── 00023-sysmeta.xml
│ ├── 00024-sysmeta.xml
│ └── 00025-sysmeta.xml
├── EDACGSTORE
│ ├── Content_Standard_for_Digital_Geospatial_Metadata_version_001-1998
│ │ └── xml
│ │ ├── 00061-metadata.xml
│ │ ├── 00062-metadata.xml
│ │ ├── 00063-metadata.xml
│ │ ├── 00064-metadata.xml
│ │ └── 00065-metadata.xml
│ └── sysmeta
│ └── xml
│ ├── 00061-sysmeta.xml
│ ├── 00062-sysmeta.xml
│ ├── 00063-sysmeta.xml
│ ├── 00064-sysmeta.xml
│ └── 00065-sysmeta.xml
├── EDORA
│ ├── Oak_Ridge_National_Lab_Mercury_Metadata_version_1.0
│ │ └── xml
│ │ ├── 00051-metadata.xml
│ │ ├── 00052-metadata.xml
│ │ ├── 00053-metadata.xml
│ │ ├── 00054-metadata.xml
│ │ └── 00055-metadata.xml
│ └── sysmeta
│ └── xml
│ ├── 00051-sysmeta.xml
│ ├── 00052-sysmeta.xml
│ ├── 00053-sysmeta.xml
│ ├── 00054-sysmeta.xml
│ └── 00055-sysmeta.xml
├── ESA
│ ├── Ecological_Metadata_Language_version_2.0.1
│ │ └── xml
│ │ ├── 00119-metadata.xml
│ │ └── 00122-metadata.xml
│ ├── Ecological_Metadata_Language_version_2.1.0
│ │ └── xml
│ │ ├── 00118-metadata.xml
│ │ └── 00120-metadata.xml
│ ├── Ecological_Metadata_Language_version_2.1.1
│ │ └── xml
│ │ └── 00121-metadata.xml
│ └── sysmeta
│ └── xml
│ ├── 00118-sysmeta.xml
│ ├── 00119-sysmeta.xml
│ ├── 00120-sysmeta.xml
│ ├── 00121-sysmeta.xml
│ └── 00122-sysmeta.xml
Produce a sample from DataONE metadata collections