azgs / azlibrary_database

1 stars 1 forks source link

Upload MineData #10

Closed aazaff closed 5 years ago

aazaff commented 5 years ago

I have a new bulk dataset of 18,222 files to add to azlibrary that I have scraped from https://minedata.azgs.arizona.edu/.

These do not come with ISO 19139 XML files, instead I am building new azgs.json metadata files as I scrape.

A few things that I need to work out before I can hand these off to @NoisyFlowers

  1. Does the bulk upload script work without ISO 19139 XML?
  2. How should I handle naming the directories and collection_id?

Everything else seems relatively straightforward to me.

NoisyFlowers commented 5 years ago

XML files are no longer required at top level; only the azgs.json file is required. It should be in the root directory.

The rest of the directory structure should be the same as documented I haven't compared to the README for a while, but I don't think anything has changed from what is described there, other than the azgs.json file.

Since these are new collections, the collection_id will be assigned when they are uploaded.

aazaff commented 5 years ago

To clarify, that means that I can name the master directory anything – my_data_collection_directory, and the new add script will overwrite this directory name with the new collection_id?

Sweet!

NoisyFlowers commented 5 years ago

Yes, it should work this way.

Keep in mind that azgs_path is built from the archive option and and new collection_id. azgs_old_url is built from a hard-coded "http://repository.azgs.az.gov/uri_gin/azgs/dlio/" and the name of the master directory. So that might look weird.

aazaff commented 5 years ago

Here is an example of a minedata collection. I programmatically generated the azgs.json file by scraping the drupal metadata. Please let me know if it works with your script!

1.zip

Note, that the azgs_old_url does not follow the the "http://repository.azgs.azg.gove/uri_gin/azgs/dlio/" format that we used with the other bulk dataset.

NoisyFlowers commented 5 years ago

Close, but not quite.

Is there a reason every string value is in an array?

NoisyFlowers commented 5 years ago

BTW, I miswrote earlier. The AZGS Old link is fabricated by azlibCreate. Since you are not using that, this won't be a problem here.

aazaff commented 5 years ago

Ugh, I thought that might be a problem. It is something the tool I'm using to write out the JSON is doing automatically... I'm sure there's some setting or parameter I can change to fix it.

NoisyFlowers commented 5 years ago

What tool is it; I can take a look.

aazaff commented 5 years ago

It’s an rscript.. jsonlite. https://cran.r-project.org/web/packages/jsonlite/index.html

Don’t worry, I’m sure I can figure it out quickly.

NoisyFlowers commented 5 years ago

auto_unbox?

aazaff commented 5 years ago

Hah! You beat me to it, yup, already tested it and it fixed it.

NoisyFlowers commented 5 years ago

Great! Show me the zip when you can and I'll try it out

aazaff commented 5 years ago

Here are two zips as examples.

1.zip 31.zip

NoisyFlowers commented 5 years ago

It's blowing up on "AZGS Miscellaneous Minedata Collection" because that is new. I'll add this and try again.

But this brings up an interesting question: Do we want to create a new collection_group on the fly when a new string is encountered?

aazaff commented 5 years ago

Good question. My instinct is to say NO.

New collection_groups should be so rare that they can be added to the db manually, or at least separately from the main collections upload.

FYI, all of the collections in this new minedata set will have the same collection_group.

aazaff commented 5 years ago

I forget... did those two sample ones work correctly once you added the new collection_group?

NoisyFlowers commented 5 years ago

Yes

On Fri, Feb 22, 2019 at 2:44 PM Andrew Zaffos notifications@github.com wrote:

I forget... did those two sample ones work correctly once you added the new collection_group?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/azgs/azlibrary_database/issues/10#issuecomment-466558710, or mute the thread https://github.com/notifications/unsubscribe-auth/AL925DzyilblgYO_5mT2ZjP5_3nHPISnks5vQGTQgaJpZM4bDtDM .