Seed category data into dynamo

manasaV3 commented 1 year ago

To seed the table with data, read the s3 file for category and batch write to the table.

The key from json should be converted to lowercase and mapped to the name field. This will help in the category endpoint being case insensitive. Example: "Ablation" -> "ablation"

The version_hash field should be set to EDAM-BIOIMAGING:alpha06:<hash>, where <hash> is an MD5 hash calculated from the strings in the category object. This will help us support cases where a single category would have multiple records.

For example, the category Scanning electron cryomicroscopy has the following entries whose labels are the same dimension Image modality and value Electron microscopy, the only thing that differs is their hierarchy:

[
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Cryo electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  },
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Scanning electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  }
]

The version_hash for each item will be EDAM-BIOIMAGING:alpha06:8a970cd15928e947a86359f0559ef8fe and EDAM-BIOIMAGING:alpha06:61a42786c40c9d4bce561ff73ec00e08

The version field should be set to EDAM-BIOIMAGING:alpha06.

The formatted_name field can be used for storing the key as is.

The last_updated_timestamp field can be generated from round(time.time()*1000).

All the other fields can be mapped as is.

We should document the script created here to be used in case we need to populate the data in the future. It could be saved inside the ETL workflow folder in the /scripts folder.

manasaV3 commented 1 year ago

In the interest of consistency, we should set this up as an ETL workflow that can be triggered by an SQS message. This will allow us to run this in different environments without any dependencies on the local environment.

Recommended structure for the SQS message body:

{ 
  "type": "category",
   "version": "EDAM-BIOIMAGING:alpha06"
   "s3_file_path": "/category/edam_bioimaging"
}

codemonkey800 commented 1 year ago

sync'd with @manasaV3 and we've decided to change the range key to be EDAM-BIOIMAGING:alpha06:<hash> where <hash> is a checksum of the contents of the category array item. This is required because some categories may have items that contain the same label. For example Scanning electron cryomicroscopy has two items with the label Electron microscopy:

https://api.napari-hub.org/categories/Scanning%20electron%20cryomicroscopy

[
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Cryo electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  },
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Scanning electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  }
]

chanzuckerberg / napari-hub

Seed category data into dynamo #866