NHMDenmark / DanSpecify

Important files regarding the Danish instance of the Specify database system for collections digitisation and management, plus placeholder for issue tracking. Guidelines, manuals and other kinds of documentations will be gathered on the wiki.
3 stars 3 forks source link

Attachments metadata and synchronization between Specify and DaSSCo Storage Service #221

Closed kimstp closed 1 year ago

kimstp commented 1 year ago

What and how much meta-data can be stored in Specify with each attachment. Can it be synchronized using the Specify API?

PipBrewer commented 1 year ago

This ticket also relates to a conversation and email sent (Pip Brewer to Fedor Steeman) on 18/01/2023 and to a later email sent on 26/04/2023 (superceding the previous one).

FedorSteeman commented 1 year ago

From correspondance with @PipBrewer:

Fields needed for attachment records: • Copyright owner • Copyright licence • Whether the attachment will be visible externally • Type of exception (see data access policy) • Person responsible for exception • Details of exception – e.g., requested by for x reason, authorized by • Exception review date • Persistent identifier for media which is resolvable (link) • Date media created • Date media deleted • Deletion reason • File format (jpeg, tiff etc) • Type/description of attachment (CT scan, specimen image etc)

More from @kimstp :

Requirements to meet Data Access Policy Allow some fields in attachment records (attachment metadata) to be publicly viewable, even when the attachment itself is not. If an attachment is deleted, the attachment metadata should still be available and published. It should be possible to have many attachments to one specimen and the metadata associated with each visible and searchable.

For all new attachment records that are created: • If it is an image it should automatically be given a CC-BY licence unless it is associated with an exception in the metadata, or this is later changed manually. • If it is a document, the attachment itself will not be visible externally.

Is it possible in Specify UI to have a dropdown list with possible licenses? This would be a way to restrict the set of licenses we use and is mainly relevant when someone wants to change a license.

FedorSteeman commented 1 year ago

After talking this through with @Sosannah we came to the following:

About half of the required metadata fields are already supported by Specify out of the box (see mapping table below). The other half could in principle be added as singular records in the attachmentmetadata table of which multiple entries can be associated with a single attachment record. In the following, these will be referred to as "extra metadata" records.

Metadate field Specify Table Specify Field name Notes DarwinCore (Simple Multimedia extension)
Copyright owner attachment CopyrightHolder http://purl.org/dc/terms/rightsHolder
Copyright license attachment License http://purl.org/dc/terms/license
Whether the attachment will be visible externally attachment IsPublic
Persistent identifier for media which is resolvable (link) attachment AttachmentLocation Only the terminal end of a "persistent" identifier prefixed by the URL of the attachment server + collection name e.g. https://specify-attachments.science.ku.dk/static/NHMD_Entomology/originals/sp66250692200096451259.att.png http://purl.org/dc/terms/identifier
Date media created attachment FileCreatedDate http://purl.org/dc/terms/created
File format (jpeg, tiff etc) attachment MimeType http://purl.org/dc/terms/format
Type/description of attachment (CT scan, specimen image etc) attachment Remarks / Type http://purl.org/dc/terms/type
Type of exception (see data access policy)
Person responsible for exception
Details of exception – e.g., requested by for x reason, authorized by how big must the field be?
Exception review date
Date media deleted
Deletion reason how big must the field be?

Generation of extra metadata records The question is how these extra metadata records can be created. We understood that the mass digitization pipeline process can be adapted to do that, and in principle we could add these extra metadata fields to the Specify Interface to enable manual editing. However, otherwise we can't think of an automatic process for doing that.

Publication Although we are unsure of how many of these metadata fields will be parsed and visible on sites like GBIF, in principle the information is published digitally online as the DarwinCore Archive generated by Specify7 (retrieved by GBIF). The DarwinCore (DwC) Archive is a zip file of csv files and can be both machine read and human read. The images are published in the archive as links that are accessible through the web asset server. However only fields that can be mapped to DarwinCore fields can be made public this way. The currently used DwC Extension Simple MultiMedia does not (see above table), but Audubon Media Description may have better options.

"Deletion" After discussion the matter of "deletion" with @PipBrewer it turned out that instead of actual deletion of attachments, rather a kind of modal "hiding" is meant where the actual image would be replaced by a thumbnail. However, we cannot see how an images is represented externally by a thumbnail yet internally by the actual image. Perhaps the replacement of the web asset server can be adapted to provide that functionality. For instance, the replacement web asset server could be programmed to represent the "deleted" image as a thumbnail to the outside, but as the actual image when requests originate internally.

PipBrewer commented 1 year ago

Following discussion with @FedorSteeman today:

  1. We need to test whether we can easily import image records with a mixture of information from the main attachment record and the extra attachment metadata records (key/value pairs).

  2. The publication information provided above is acknowledged. If the image metadata can't be accomodated by a publisher, I'm not concerned. If it can be acccomodated, we should try to map it.

  3. There are 2 associated issues regarding attachments which should not be viewable on the web. For attachments which have been deleted in our storage system (web asset server), it sounds like we can still publish the attachment records (the metadata), but with a stand in thumbnail. This is "tombstoning" the record and meets our needs. For attachments where we would like to embargo or not show the attachment itself, but would like to show the metadata/attachment record, Specify does not currently support this. This is something that should be requested by the Specify community on behalf of DiSSCo. In the meantime, the DaSSCo website, could provide details on the number of records embargoed for transparency purposes.

I also discussed the two additional requirements needed - For all new attachment records that are created: • If it is an image it should automatically be given a CC-BY licence unless it is associated with an exception in the metadata, or this is later changed manually. • If it is a document, the attachment itself will not be visible externally.

  1. Fedor advised that when anyone creates a new attachment record we could always default the licence to CC-BY. This we should be able to manually change after the effect if necessary.

  2. For documents, there is no way to automatically mark new attachment records that are not to be pushed to the web. For published datasets, this is being selected manually (i.e., only images and ones which are selected as being pushed to the web are) during the mapping phase.

WOuld it be possible to get an idea of timeline as to when A. the additional metadata fields can be added to teh attachment records and B. when we can test that we can do a workbench import using mutliple attachment tables (e.g., including the extra metadata fields)? Adding the licence automatically can come after them.

PipBrewer commented 1 year ago

Follow on.

Fedor advised that You can't map attachment data in Workbench and neither attachment metadata also need to add extra attachment fields to the data entry form

So additional work required:

FedorSteeman commented 1 year ago

@Sosannah and I agreed that this issue is covered and therefore subsumed by #222 and will therefore be closed.