fathomnet / community-feedback

1 stars 0 forks source link

Implement scanner to load datasets hosted at NCEI #165

Closed hohonuuli closed 2 months ago

hohonuuli commented 2 months ago

The NOAA data hosting workflow (#88, #136), archives images on a staging server with directory listing enabled. We have deployed a service, bitfrost to stage uploaded datasets to https://fathomnet.org/static/staging/. Once staged, NOAA will, in turn ,copy new files to their staging server at https://oer.hpc.msstate.edu/FathomNet/staging/ and send FathomNet an email.

We have a few options to trigger registration of new datasets: one is write a service to listen for the incoming emails, another is to scan the remote directory and look for new datasets. I'll implement the later. The flow is:

  1. Periodically scrape the directory listing at https://oer.hpc.msstate.edu/FathomNet/staging/
  2. Scrape each subdirectory. If it contains a .csv file and a darwincore.json file then it's a candidate for registration in FathomNet
  3. Parse the darwincore.json file and see if it's datasetID already exists in FathomNet
  4. If it's a new datasetID, parse the .csv file, remap the image URLs and register in FathomNet
hohonuuli commented 2 months ago

Added a new endpoint to support lookup of darwincore info by datasetID. Example URL: https://fathomnet.org/api/darwincore/query/datasetid/3DA36EB8-A650-F34C-B283-6A2DC89623BB

hohonuuli commented 2 months ago

See https://github.com/fathomnet/fathomnet-support/pull/6. Application code in ArchiveScanner.scala. Usage is:

archive-scanner "https://oer.hpc.msstate.edu/FathomNet/staging/" -a <fathomnet apikey>

A dry run can be executed by including --dryrun:

archive-scanner "https://oer.hpc.msstate.edu/FathomNet/staging/" -a <fathomnet apikey> --dryrun