Define sidecar file format

cgendreau commented 3 years ago

The harvestor will use a sidecar file next to the file to upload has source of metadata for the object-store-api.

This ticket is to define the file format (yml?) and the format (content) of the sidecar file that the harvestor will need to read.

dshorthouse commented 3 years ago

Some items/tasks:

[ ] Define a file format for the sidecar file (preference: key:value with .yml extension)
[ ] Define nesting of keys to accommodate 'core' metadata vs. managed attributes
[ ] Check for presence of valid 'core' metadata as well as key(s) within block of managed attributes
[ ] Create key in managed attributes upon ingestion if not already present, use the key name as the description
[ ] Do not ingest multimedia asset if corresponding .yml with same filename is not present
[ ] Do not ingest multimedia asset if sidecar file is malformed
[ ] When there is both a .jpg and a .cr2 (RAW) with the same filename, push the .cr2 in first, receive a pointer for it in the API response then push in the .jpg along with the sidecar metadata file & additionally associate the relationship between the .cr2 and the .jpg with the latter derived from the former. The metadata for the sidecar resides with the .jpg.

dshorthouse commented 3 years ago

@cgendreau Might help us along if you post a comment here with an example yaml file as a block of code showing core metadata elements w/ whatever key/nesting is necessary to express managed attributes

cgendreau commented 3 years ago

To upload a file you only need the name of the bucket which is the group. It doesn't really need to be in the sidecar I would say. When you upload the original filename is also sent in the multipart upload.

For the metadata the simplest form is something like:

acMetadataCreator: uuid of the agent
dcCreator: uuid of the agent
acDerivedFrom: uuid of the metadata (only if it's a derivative)

managedAttributes:
  uuid: value
  uuid: value

tags:
  tag1
  tag2

cgendreau commented 3 years ago

As you can see it requires uuids everywhere. I'm fine with saying that they should remain stable so that we should export them to prod when prod will be ready (managed-attributes and required agents).

dshorthouse commented 3 years ago

Can you please add the full breadth of all metadata keys, including licensing and the like for completeness-sake?

acDerivedFrom is a UUID of the asset or the metadata about the asset? [Though how this would be accomplished in-flight when related assets are ingested in tandem is a separate matter]

The UUIDs for managedAttributes are not human-readable keys but are internally-known UUIDs?

cgendreau commented 3 years ago

acDerivedFrom is the metadata uuid of the original file metadata record, it can't be in the side car since it's not yet available. But you need to track something that point to the "original" file, the harverstor should then coordinate that.

This list is quite up to date: https://dina-web.github.io/object-store-specs/#operation/addMetadata

dshorthouse commented 3 years ago

Are there any size limits for the values of a key? For example, I'd like to include an ocr key whose value will be a very large blob of text.

TODO: @cgendreau will increase size limit for a value to be TEXT rather than 256 characters. [Or, we get yet a new field type for some items that we know will be a very large value]

For a Pick list of values, do these values also have UUIDs or are they merely text? [Answer: no, but the request will be rejected if an expected value is not present in the Pick list]

cgendreau commented 3 years ago

Ticket to remove the limit created. We will still need to put a limit at some point to avoid wrong usage.

dshorthouse commented 3 years ago

Ticket to remove the limit created. We will still need to put a limit at some point to avoid wrong usage.

Thanks, @cgendreau. I'd be happy with a new value type to differentiate it from what already exists (& limited to 256 for useful reasons).

cgendreau commented 3 years ago

you mean another type than INTEGER, STRING ?

dshorthouse commented 3 years ago

Yeah, like TEXT.

Such that if I specified STRING as a type and then attempted to dump TEXT into it, that would/could cause an exception.

cgendreau commented 3 years ago

I'm fine with STRING. Just that there is always a limit. It might be 1 million chars but we will need a limit.

kardecom commented 3 years ago

Question for @dshorthouse and @cgendreau

it's clear that managedAttributes is required in yml

What about the following:

acMetadataCreator: uuid of the agent dcCreator: uuid of the agent acDerivedFrom: uuid of the metadata (only if it's a derivative) tags:

are any of the above attributes optional or they are all required?

cgendreau commented 3 years ago

required by? At the API level nothing is required since all the mandatory fields can be found elsewhere or have good default values.

kardecom commented 3 years ago

When harvester parses yml file I can make them required and it will fail if they are not there. If they are not required it will not fail if none of them present in yml

dshorthouse commented 3 years ago

I guess the underlying question is what should be the values of those terms, especially acMetadataCreator and dcCreator. In the case of the harvester, is it the API agent, the human that made the sidecar file, the script agent that may have made the sidecar file, or other? The use of UUID here for an Agent suggests that these are all humans, known and already present in DINA. If the harvester were on client desktop machines, who/what is authenticating?

cgendreau commented 3 years ago

Agents are human, the script is authenticated as itself and doesn't have an agent.

cgendreau commented 3 years ago

acMetadataCreator is still a person originally creating the resource metadata record and dcCreator The person responsible for creating the media resource If the agent uuid is known, who took the picture and who filled the metadata. If no known, nothing.

dshorthouse commented 3 years ago

There will be many nested directories for the BioMob images, each leaf directory containing precisely the same complement of images with exactly the same name, the most relevant here being Image001.cr2 and Preview001.jpg. My intention was to have a single sidecar file in each of these leaf directories entitled, metadata.yml with the following structure. Note the original and derivative keys in this yaml to identify the cr2 and the jpg derivative. Is what you see below acceptable?

---
# Use UUID for Heather Cole
acMetadataCreator: d3681c90-80a3-43b7-8471-23ef718c3967
# Use UUID for Heather Cole
dcCreator: d3681c90-80a3-43b7-8471-23ef718c3967
acDigitizationDate: "2019-11-06T08:44:55"
original: Image001.cr2
derivative: Preview001.jpg
managedAttributes:
  # folderBarcode
  f7f1008e-3b33-44a0-b4ca-6255e09acc6c: "F-015357"
  # barcode
  c2693c3f-d09f-4b0a-89eb-06c53238b203: "01-01460692"
  # catalogNumber
  625b63a5-a3c5-4502-94d5-7be1a1c441ea:
  # country
  34f3d6a4-955d-4fc7-aa5f-3992f568d059: Canada
  # stateProvince
  cfeefd93-1f6c-404f-b87d-3f13c045a7da: "British Columbia, Quebec"
  # folderColour
  a1ea3e24-0151-49b1-94ec-67c2b910fefe: beige
  # filedAsName
  38fe5e72-8c48-43c3-a398-495a0f3a695c: Viburnum edule
  # ocr
  008dabaa-5d77-4601-8487-c05de07a5788: >
    C1 B1 A1 C2 B2 A2 BS AS 20 18 17 16 11
    Patch Reference numbers on UTT
    NUE
    DAO 01-01460692
    QUEEN CHARLOTTE ISLANDS
    Viburnum edule (Michx.) Raf.
    by
    te
    few shrubs up to 15 ft. in height
    NN
    twining among alders along margir of
    coniferous bordering upper part of beach.
    Fast side of Yakoun Lake near mouth
    of Baddeck Creek, Graham Island.
    No.
    36784 jJ.A.Calder Aug. 10,
    R.L. Taylor
    \
    1904
    Dept. of Agriculture, Ottawa, Canada
    10 09
    03 02 01 C7 B7 A7 C8 B8 A8 C9 BY
    the scale towards document
    ne
    SABE Lah oe en Oe
    te

heathercole commented 3 years ago

hi, this output is mysterious to assess without additional context, there seem to be several requirements addressed, but I need to understand a bit more about the example.

I think additional context related to the workflow/processing is also relevant for assessing this output; At the moment, since all the images have the same file-name, the identifier is at the folder level. Something connecting those may be needed else, how would the original "Image001.cr2" be located? I can find it manually, using the 'date taken' info, but perhaps this isn't needed, depending on how the workflow works.

From the data above, it is not clear to me what information is coming from the folder image, and what information is coming from the specimen image. It is vital that is clear, as the data is relevant to different fields for the specimen record.

There is also an OCR issue, as the folder info is "BC-QCI" which represents British Columbia-Queen Charlotte Islands" (not British Columbia, Quebec). In the pilot-project, Ariel presented some confidence values which we discussed may be relevant for identifying output which needs review. Early discussions with David also touched on a green/yellow/red approach for the OCR data. Perhaps a metric could be included in the output so it can be searched on to address issues (or a tool).

what does the "Catalog number" in the metadata above represent?

can there be a line that changes the folder colour value to the informative text (eg. beige = North America)

can the ocr text coming from the ruler/colour bar (in every image) be removed? ( 10 09 03 02 01 C7 B7 A7 C8 B8 A8 C9 BY the scale towards document ne SABE Lah oe en Oe te

great: connection between specimen image to its folder great: connection between specimen image and the folder data (eg. taxonomy/geography)

bonus: the additional OCR from the specimen image may certainly be relevant if it will be possible to query on the text (eg. collector name may not be captured in the database for a long time, so awesome if there will be a way that this text can be effectively searched. If there won't be any effective way to search the text contained here, it should perhaps be discussed how its value can be effectively moved/migrated/accessed at some point. If that won't be possible, perhaps OCR for the specimen should only pull the barcode. I would hate to not harness this output, it has a lot of potential.

I am assuming that this ticket is not about the image file-formats, just this 'sidecar' file output.

Thanks for this example, it is great to see these pieces coming together.

heathercole commented 3 years ago

related to OCR quality, I don't think this first pass needs to be perfect, but there needs to be some way the output can be reviewed/corrected when errors are identified. Eg. find all "British Columbia, Quebec" and replace with "British Columbia-Queen Charlotte Islands"

I know for sure that the folder taxonomy is a vital piece of the 'filed as' requirement for DAO management, however, I will review at what level the folder geography information needs to be managed vs. the geography from the specimen. I will consult with Shannon and report back.

dshorthouse commented 3 years ago

@heathercole There will be one of these metadata files in each of the directories so file locations for original and derivative are relative to the file location of this example file, metadata.yml.

Yep, "BC-QCI" was a crap, manual interpretation here on my part, not something done through the script. My bad.

catalogNumber could be lifted from the OCR, not implemented atm because it's rarely accurate.

Take-home here is that the above is what we're presently playing with re: structure of the metadata file. One hiccup is the use of acMetadataCreator and dcCreator, requiring an existing Agent in DINA. In the absence of this, am using you.

heathercole commented 3 years ago

ah, I mis-read the catalogNumber, as the string, not empty. Not clear if we are talking about the same data as "catalog number" in the case of DAO, that would be the barcode. There would be no reason for the OCR to capture the accession stamp number from the page in a structured way.

The file and directory structure you are describing is not clear to me, the gray boxes with the filenames do not have links/connections/paths that I can see.

From the Herbaria requirements, there is no need to import/migrate the .jpg for permanent storage, unless it is practical to not generate the jpg derivative again from the .cr2. As discussed, the system will need to give users the option for highest- or lower-resolution output in non-proprietary file formats.

it may be worth indicating somewhere that the images are from the conveyor belt and the data is from script (or that both are from "BioMob" vs. me). But I don't think there are any particular requirements here, except that if I am trying to query based on uploads, it would be much more useful to not have "me" as the source of all the images. However, that is maybe conversation for the import/migration of the images, and whether those creator fields are the same in the sidecar as the 'attributes' on import.

cgendreau commented 3 years ago

The Uploaded By is optional so we could also set no agent at all (since it's kind of that). The system will know it's coming from the service account used by the script. We will need to think about "bot" agent but it's not in place at the moment.

cgendreau commented 3 years ago

@dshorthouse could you elaborate on :

My intention was to have a single sidecar file in each of these leaf directories entitled, metadata.yml with the following structure.

So a leaf folder will have maximum 3 files? Image001.cr2, Preview001.jpg and metadata.yml ?

dshorthouse commented 3 years ago

A leaf folder may have other files within it (often a thumbnail jpg), but only one of Image001.cr2, Preview001.jpg and metadata.yml all named the same across all leaf folders.

kardecom commented 3 years ago

Hi @dshorthouse. I would like to run few use cases with you if it's ok

1). We have a leaf folder, lets say blah_001. Inside Image001.cr2, Preview001.jpg and metadata.yml. Questions about case 1). None

2). We have a leaf folder, lets say blah_002. Inside only 1 file Image002.cr2. Questions about case 2). Do we upload Image002.cr2 even with missing metadata.yml since it has never been any derivatives from it?

3). We have a leaf folder, lets say blah_003. Inside Image003.jpg (not cr2), Preview003.jpg and metadata.yml. Questions about case 3). Is it possible?

4). We have a leaf folder, lets say blah_004 with only 2 files in it. Image004.cr2, Preview004.jpg. Questions about case 4). Do we upload anyway and not building any relations or we ignore the leaf folder?

5). We have a leaf folder, lets say blah_005. Inside Image005.cr2 and metadata.yml. Questions about case 5). Preview005.jpg physical file is missing, but it's in the metadata.yml as a derivative. What do we do? The same question for vise versa on origin

6). We have a leaf folder, lets say blah_006. Inside Image006.cr2, Preview006.jpg, thumbnail0061.jpg, thumbnail0062.jpg and metadata.yml. Questions about case 6). Do we ignore all thumbnail? What ever in the metadata.yml(original & derivative) is only we care about from api object store perspective. Is it correct?

7). in metadata.yml(original & derivative) can original be the same as derivative due to 'save as' the same file?

It should be it for now. Could you please provide the answers at your convenience?

Thank you in advance, Dmitri

dshorthouse commented 3 years ago

Hi @dshorthouse. I would like to run few use cases with you if it's ok

1). We have a leaf folder, lets say blah_001. Inside Image001.cr2, Preview001.jpg and metadata.yml. Questions about case 1). None

2). We have a leaf folder, lets say blah_002. Inside only 1 file Image002.cr2. Questions about case 2). Do we upload Image002.cr2 even with missing metadata.yml since it has never been any derivatives from it?

No upload of a cr2 in the absence of a metadata.yml file. This is also true of the Preview001.jpg.

3). We have a leaf folder, lets say blah_003. Inside Image003.jpg (not cr2), Preview003.jpg and metadata.yml. Questions about case 3). Is it possible?

Not in the particular use-case here. So, as in answer for (2).

4). We have a leaf folder, lets say blah_004 with only 2 files in it. Image004.cr2, Preview004.jpg. Questions about case 4). Do we upload anyway and not building any relations or we ignore the leaf folder?

If you mean no metadata.yml file, then no upload.

5). We have a leaf folder, lets say blah_005. Inside Image005.cr2 and metadata.yml. Questions about case 5). Preview005.jpg physical file is missing, but it's in the metadata.yml as a derivative. What do we do? The same question for vise versa on origin

Not sure this is ever the case eother. If no Preview001.jpg then no upload.

6). We have a leaf folder, lets say blah_006. Inside Image006.cr2, Preview006.jpg, thumbnail0061.jpg, thumbnail0062.jpg and metadata.yml. Questions about case 6). Do we ignore all thumbnail? What ever in the metadata.yml(original & derivative) is only we care about from api object store perspective. Is it correct?

Yes, ignore everything except the cr2, the preview jpg and the metadata.yml file.

7). in metadata.yml(original & derivative) can original be the same as derivative due to 'save as' the same file?

Should never be the case.

Of note here for clarity-sake, ALL the files will be named exactly Image001.cr2 and Preview001.jpg in every one of the leaf folders. Although they have "001" in their file names, these are meaningless.

kardecom commented 3 years ago

Thank you. It's very clear. I know it's very early, but I would like to ask one more question. The same rules would apply for images on PC of scientist as well or it's only for Bio Cluster?

dshorthouse commented 3 years ago

@kardecom Thanks for asking this important question. My responses are for BioCluster for now. We've not yet had any discussion with all the users about these metadata.yml files; they are not the easiest things to construct because they require knowledge of the the UUIDs for the keys in the key:values in that YAML structure.

kardecom commented 3 years ago

I see.

kardecom commented 3 years ago

Hi @dshorthouse @cgendreau I'm ready to test your test set on the cluster. Whenever you have time could you please confirm the location of the test set on the cluster? Thank you.

dshorthouse commented 3 years ago

@kardecom Super! Here's where I've dumped a nested directory, comparable to how these will be structured for the "real" thing: /home/AAFC-AAC/shorthoused/objectstore_migrator_test

kardecom commented 3 years ago

@dshorthouse Could you place under /tmp folder pls or any other shared folders. I can't access your home directory. Linux has permissions for each account which will not allow other accounts to see what is in your folder.

dshorthouse commented 3 years ago

@kardecom I was worried about that. We'll have to also sort-out permissions for access to directories where the original assets reside as that's where too I expect to have the sidecar files. I just now copied into /tmp/objectstore_migrator_test in the interim.

kardecom commented 3 years ago

@dshorthouse if you would like we can do a quick meeting to define the strategy how to pass that test set for the demo. Please let me know if it works for you

kardecom commented 3 years ago

The existing workflow from harverstor perspective works 100% with this sidecar file format As a very solid sidecar candidate, I would recommend to close this issue.

AAFC-BICoE / object-store-harvestor

Define sidecar file format #14