NHMDenmark / DaSSCo-Integration

This Repo will include integration of dassco storage from northtec
0 stars 0 forks source link

Metadata structure differences NT/Ingestion/Integration #25

Open Baeist opened 9 months ago

Baeist commented 9 months ago

List where we and NT differ with the metadata json when comparing our ndrive files /integration server model with their documentation.

Dassco Field NT Field Dassco Type NT Type
date_asset_taken date_asset_taken string date
asset_updated_by update_user string string
metadata_uploaded_by metadata_uploaded_by string list[events]
date_metadata_uploaded date_metadata_uploaded string list[events]
date_asset_finalised string date
date_asset_created date_asset_created string date
date_asset_deleted date_asset_deleted string date
date_asset_updated list[str]
date_metadata_created date_metadata_created string date
date_metadata_updated date_metadata_updated list[str] list[events]
file_format file_format str list[str enums]
metadata_created_by string
metadata_updated_by metadata_updated_by string list[events]
payload_type payload_type list[str] string
pushed_to_specify_date string
restricted_access restricted_access boolean list[str enums]
status status string string enum
specimens list of specimens

These are the ones where we are using a string instead of a date object. Recommend that we fix this. Requires using "null" instead of empty "" for fields in the metadata from ingestion client. NT creating events is for their internal book keeping and it totally fine. I believe all of these would create an event for their event list.

Dassco Field NT Field Dassco Type NT Type
date_asset_taken date_asset_taken string date
date_asset_created date_asset_created string date
date_asset_deleted date_asset_deleted string date
date_metadata_uploaded date_metadata_uploaded string list[events]
date_metadata_created date_metadata_created string date
date_metadata_updated date_metadata_updated list[str] list[events]
This looks like an outdated NT field name for the first part. For the second part we keep track of this but then why do NT need a field for updated_by? And should our updated by not also be a list. Should also change to date type in the list. Dassco Field NT Field Dassco Type NT Type
asset_updated_by update_user string string
date_asset_updated list[str]

NT acknowledges this field in their documentation but has no name field name for it. It is possible they dont need that. Are these the same? "Pushed to specify date" does not exist in NT documentation. Its dates and we should use the correct type.

Dassco Field NT Field Dassco Type NT Type
date_asset_finalised string date
pushed_to_specify_date string

We are currently treating this as a single entry field. We should update to make it a list of strings. There is a list of accepted names in NT documentation. We need to change to using capitalized letters only. It would be more flexible for us if NT got rid of it being an enum list.

Dassco Field NT Field Dassco Type NT Type
file_format file_format str list[str enums]

This needs to be a list? Not sure who has the truth here. Currently we are only noting one type though.

Dassco Field NT Field Dassco Type NT Type
payload_type payload_type list[str] string

This needs to be looked at. NT documentation says its a list of user types.

Dassco Field NT Field Dassco Type NT Type
restricted_access restricted_access boolean list[str enums]

This is not an issue. It contains the specimen data (multiple if its a multispeciment) for each asset (barcode, speciment_pid, preparation_type, collection and institution).

Dassco Field NT Field Dassco Type NT Type
specimens list of specimens

We are currently not populating this field. NT does not have it. Is it necessary?

Dassco Field NT Field Dassco Type NT Type
metadata_created_by string
ThomasAlscher1991 commented 9 months ago

After thorough examination of the templates, @bhsi-snm, @Baeist and @ThomasAlscher1991 have come to the following conclusions:

  1. The groundtruth metadata template will be stored somewhere central and accessible. For now it is under https://github.com/NHMDenmark/DaSSCo-Image-Utils/tree/main/src/DaSSCoUtils
  2. The metadata template will consist of a .json file, a csv file (excel) and an accompanying metdata class in python. The purpose of the class is to define allowed data types for certain fields that
  1. The following fields have conflicts that need to be resolved (excel file here ). For this we need some feedback from @PipBrewer
. DaSSCo Field Northtec Field DaSSCo Type Northtec type DaSSCo description Northtec description Conflict Proposed change
asset_updated_by update_user string string The name of the Pipeline that updated the asset. This will be picked from the "pipeline_name", sent under update. Username of the person that updated the asset We assume both fields mean the same but have different names. Since both fields track which entity updated the asset, let's change the definition to : "Entity that updated the asset". Also let's harmonize the names, so Northtec should change their name to ours.
pushed_to_specify_date date_asset_finalised string date Who syncs with specify? Asset Service or Pipeline? Who syncs with specify? Asset Service or Pipeline? We assume both fields mean the same but have different names. Rename our field.
payload_type payload_type list of strings string What the asset represents (important for how it is processed and when linking to Specify) What the asset represents (important for how it is processed and when linking to Specify) Same names but different types Since an asset is one (1) digitization of an object, any asset can only have one payload type. So we change our type to Northtec typ
restricted_access restricted_access boolean/array list of strings (enumerated) The problem here is that the use of this field is not clearly defined. Change the type of this field to enumerated list of strings so we get more information why or to whom it is restricted.
bhsi-snm commented 9 months ago

@PipBrewer We have tried to map metadata fields between NorthTech and our template. We have found the above fields pose the questions @ThomasAlscher1991 has mentioned. Can you please have a look at the above fields and help us by answering the following questions:

  1. asset_updated_by -- we couldn't find anything similar in the NorthTech confluence documentation. The closed field that makes sense here is something called "update_user". Should we change our name or ask NorthTech to change their name? In my mind asset_updated_by conforms well to our nomenclature
  2. Similarly, pushed_to_specify corresponds to date_asset_finalised in NorthTech Confluence documentation. there is a relevant field called asset_locked which is present in both our and NorthTech which is a boolean(true/false) and tells if an asset has been pushed to specify.
  3. the field payload_type: should it be a string or a list?
  4. restricted access: should it be boolean or list - I think this is a broader discussion and will also be included in the discussions when consulting on Specify bridge with NorthTech.
PipBrewer commented 9 months ago

@ThomasAlscher1991 @bhsi-snm I think a face-to-face discussion initially would be easier to resolve these conflicts

PipBrewer commented 9 months ago
  1. asset_updated_by: absolutely, please feel free to change NT name to this.
  2. asset_locked is fullfilling the same role as dat_asset_finalised, but the latter has more granularity, so keep the latter. pushed_to_specify is something different. An asset is locked or finalised when the pipeline is complete and no more changes should happen to the asset (ever). It is from this point that the asset is available to be pushed to Specify, but the actual date might be later. In addition, the metadata might change and so the latest date that the metadata is synced with Specify may be much later than the date the asset was finalised.
  3. payload_type: this should be a list (e.g., a CT scan folder might include tiffs and a .txt file).
  4. restricted access: it is not clear exactly how this should work just yet (we need to chat). Let's go for a list of strings.
ThomasAlscher1991 commented 9 months ago
Baeist commented 9 months ago

We need to change the following fields in our metadata file:

Field Change from Change to Comment
barcode string list[str] Each asset can have multiple specimens each with its own barcode.
specimen_pid string list[Map<barcode, string>] Speciment pids needs to be mapped to the specimens barcode for each asset
preparation_type string list[str] An asset can consist of multiple preparation types. They are not mapped since its for the whole asset not the individual specimens.

For NT it means that preparation_type in their specimen protocol needs to be updated to a list. The rest is fine for them.

Baeist commented 8 months ago

@bhsi-snm updates