Closed manasaV3 closed 1 year ago
In the interest of consistency, we should set this up as an ETL workflow that can be triggered by an SQS message. This will allow us to run this in different environments without any dependencies on the local environment.
Recommended structure for the SQS message body:
{
"type": "category",
"version": "EDAM-BIOIMAGING:alpha06"
"s3_file_path": "/category/edam_bioimaging"
}
sync'd with @manasaV3 and we've decided to change the range key to be EDAM-BIOIMAGING:alpha06:<hash>
where <hash>
is a checksum of the contents of the category array item. This is required because some categories may have items that contain the same label. For example Scanning electron cryomicroscopy
has two items with the label Electron microscopy
:
https://api.napari-hub.org/categories/Scanning%20electron%20cryomicroscopy
[
{
"dimension": "Image modality",
"hierarchy": [
"Electron microscopy",
"Cryo electron microscopy",
"Scanning electron cryomicroscopy"
],
"label": "Electron microscopy"
},
{
"dimension": "Image modality",
"hierarchy": [
"Electron microscopy",
"Scanning electron microscopy",
"Scanning electron cryomicroscopy"
],
"label": "Electron microscopy"
}
]
To seed the table with data, read the s3 file for category and batch write to the table.
The key from json should be converted to lowercase and mapped to the
name
field. This will help in the category endpoint being case insensitive. Example:"Ablation" -> "ablation"
The
version_hash
field should be set toEDAM-BIOIMAGING:alpha06:<hash>
, where<hash>
is an MD5 hash calculated from the strings in the category object. This will help us support cases where a single category would have multiple records.For example, the category
Scanning electron cryomicroscopy
has the following entries whose labels are the same dimensionImage modality
and valueElectron microscopy
, the only thing that differs is their hierarchy:The
version_hash
for each item will beEDAM-BIOIMAGING:alpha06:8a970cd15928e947a86359f0559ef8fe
andEDAM-BIOIMAGING:alpha06:61a42786c40c9d4bce561ff73ec00e08
The
version
field should be set toEDAM-BIOIMAGING:alpha06
.The
formatted_name
field can be used for storing the key as is.The
last_updated_timestamp
field can be generated fromround(time.time()*1000)
.All the other fields can be mapped as is.
We should document the script created here to be used in case we need to populate the data in the future. It could be saved inside the ETL workflow folder in the
/scripts
folder.