Annotation Design Discussion

DataONEorg / hashstore

HashStore, a hash-based object store for DataONE data packages

Apache License 2.0

1 stars 1 forks source link

Annotation Design Discussion #56

Open doulikecookiedough opened 1 year ago

doulikecookiedough commented 1 year ago

Questions & Todo:

Discuss how Annotations should be implemented in HashStore
What format should we use to store annotation content in /hashstore/metadata? JSON-LD or EML?
What is HashStore's responsibility when storing annotations?
- Is the EML document already formed at this point?
- Where is the content coming from?
- Who currently creates the EML documents to be stored?
Summarize issue discussion into substorage design document

Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):

A dataset that is represented by an EML document can be broken down to 2 components:
- Attributes that describe the dataset (ex. title, author, method, keywordSet, etc.)
- Attributes that represent the tables associated with the dataset (ex. dataTable, otherEntity, etc.)
A HashStore annotation is a mapping document that should consist of a single parent member and a list that represents the child members
- This document's location in hashstore/metadata is formed by calculating the SHA-256 hex digest of a given pid and formatId
  - The parent member's value is the id (location) of the parent metadata document in hashstore/metadata
    - The id/location/address of this document is formed by calculating the SHA-256 hex digest of a given pid, formatId and the string "parent". Ex. sha-256(pid + formatId + "parent")
    - This document is composed of the attributes/content that describe the dataset (ex. title, author, method, keywordSet, etc.)
  - The List/HashMap of child members are represented with a number as the key, and the id (location) of the child's metadata document in hashstore/metadata as the value
    - The id/address of each child is formed by calculating the SHA-256 hex digest of a given pid, formatId and (int) key. Ex. sha-256(pid + formatId + 0) where 0 is the first table in the dataset
    - Each child represents a data table in the dataset, or chunk of data that belongs to the dataset
Note: The format of the parent/child metadata documents to be stored/chunked requires further discussion/clarification

---
title: HashStoreAnnotation Class
---
classDiagram
    direction RL
    class HashStoreAnnotation{
        +String Parent
        +List~Dict/KVP~ Children
        +setParent(string)
        +setChildren(List)
        +getContent()
        +setContent()
        +getChildrenTotal()
    }

Example/flow to store an annotation document:

hs_annotation = HashStoreAnnotation()

// Get and store parent content
// Get and store children content

// Get parent location
dataset_parent = sha-256(pid + formatId + "parent")
// Create child list
dataset_children = [
    {0: sha-256(pid + formatId + 0)},
    {1: sha-256(pid + formatId + 1)},
    ...
]
hs_annotation.setParent(dataset_parent)
hs_annotation.setChildren(dataset_children)

// getContent() will format the document to be written based on the chosen format
hs_annotation_content = hs_annotation.getContent()

hashstore.store_metadata(pid, hs_annotation_content, formatId)

Example/flow to work with/retrieve an annotation document:

// Retrieve the mapping document
hs_annotation_stream = hashstore.retrieve_metadata(pid, formatId)
hs_annotation = HashStoreAnnotation.setContent(hs_annotation_stream)
hsa_parent = hs_annotation.parent
hsa_children = hs_annotation.children

// Iterate over the first 1000 table items
for i in range(0, 1000):
     rel_path = shard(hsa_children[i])
     location = `/hashstore/metadata/` + rel_path
     // ... Do what we will with each child element

mbjones commented 1 year ago

Thanks for getting this started, @doulikecookiedough . We should discuss, as I don't understand the parent/child nomenclature you are mentioning above. TO kick things off, I think it is useful to list examples things we might want to store in an annotation store. Here are some expressed as triples:

<EntityPID1> <ore:aggregatedBy> <OREPID3> # Expresses that a data file is a member of a package (primary use case, typically in an ORE doc)
<EntityPID1> <cito:documentedBy> <EMLPID4> # Expresses that a data file is documented by a metadata file (typically in an ORE doc)
<EntityPID2> <prov:wasDerivedFrom> <EntityPID1> # Expresses a provenance relationship
<EMLPID4> <dwt:subject> <adcad:Hydrology> # Expresses that a metadata file has a subject of a particular discipline; many other classification-style annotations like this could be used

It might be best to have these grouped as a single file graph for each object (serialized in JSON-LD or RDF formats), which would be indexed like other metadata. So in this case, we would have 3 annotation files, one each for EntityPID1, EntityPID2, and EMLPID4. In this example, annotations are stored in the annotation file for the subject PID. The DataONE indexer would then be responsible for indexing these annotation files anytime they change, and adding the annotations to the appropriate index systems (currently our SOLR object index, but we will likely need another for the entity/package mapping).

doulikecookiedough commented 1 year ago

@mbjones Thank you for the prompt feedback! When I was reviewing how we would implement Annotations, I was specifically looking at how we could store the science-metadata.xml document found in the /metadata folder when downloading datasets. My understanding (or misunderstanding now...) is that I thought these types of files would be the extremely large files we would have to break down into pieces that could be retrieved/treated as a whole.

To clarify, using your example above:

There would be 1 annotation file for the subjectPID
- This would be stored like any other metadata: store_metadata(subjectPID, formatId)
The content of this annotation file for the subject PID, would contain 3 lines (annotations), one for each of EntityPID1, EntityPID2, and EMLPID4.
- The content for these annotations can be stored and retrieved like any other metadata document

If we're on the same page now, I'm wondering how HashStore plays a role in the Annotation implementation...? It would appear that the calling app would still be working with metadata in HashStore with the same Public API calls. If not, should we save further discussion here for the backend dev meeting to discuss so Robyn/Rushi can also get additional context?

mbjones commented 1 year ago

The content of this annotation file for the subject PID, would contain 3 lines

Maybe, or maybe not. Depends on the format. If stored as JSON-LD, it would likely have more lines, and certainly a lot more if in RDF/XML format.

doulikecookiedough commented 1 year ago

Notes:

Background:

EML documents are not going anywhere, they will continue to be possible
If a data package has multiple entities, we get an ORE map that lists every member as part of that package - and an EML entry that that lists the metadata for the entity. This is not an issue when there are not too many entities (100s)
When we get a huge dataset, these documents slow down our systems, especially the Javascript frontend

Design Goal:

If we have to attach millions of files to a data package, how do we do it without creating a file with a million entities? Each member should get to state who they are a member of, then we use a service to mediate the relationship. The main difference here from what we currently have is that we would have an entity service that is queryable.

Annotation General Design/Infrastructure Discussion:

So, instead of centrally listing all the entities as part of one package in an ORE document, we propose that there would be annotation metadata documents in HashStore, which would declare itself to be part of a package.
- A new type of "a mini annotation document" would be created that is particular to a specific PID.
- These documents would either be in JSON-LD or RDF format, it must be flexible enough to store multiple types of information, like a mini ORE For each PID.
- Note: Every dataset would still have an ORE document, and that document might list 3 entities that are part of it. In addition, there might be 17 entities that are also stored and part of that package.
This system is essentially a "Paged API", so when creating a page view in a dataset, we could efficiently get the first 50 entities of that dataset.
- Metacat/Dataone will be in charge of indexing all these annotations/building the entity index for quick retrieval
A dataset that has a million files is not going to be uploaded through the MetacatUI. We would create a dataset/package in MetacatUI (so that it exists) and another API would be responsible for attaching a datafile to the dataset/package.
We would need a way to say to Metacat, here is the datafile to attach to one package and here is the annotation. Then Metacat would be responsible for writing that file to the PID store, and maintain the annotation associated with it.
- Metacat and MetacatUI would also need to be able to communicate to keep attaching annotations and attributes.
- In order to add a file to a data package, we wouldn't change the data pacakge, but the annotation store instead.
Every data package should still have a resource map as it's where we store the identifier for the package
There could be a million annotation files, with a handful of triple files. So the metadata associated with each entity, is a property of the entity and the EML it's associated with.
- Ex. This entity is associated with that package. That package contains the EML file with this version V1 of the EML. This package of EML has some entity metadata from the annotation file.
For a data package with a million entities, we are not able to serialize it in a logistically feasible way. So to help make researchers' lives easier, we will try to come up with a design that does not force them to learn how to work with more involved file formats such as NetCDF to interact with their large datasets.

Potential Issue to Ponder:

Context: The V1 annotation document would have top level metadata of the document. We would also have a million annotation files, one for each 'CSV' file. This V1 annotation document would have the EML entity section for each CSV file that's associated with V1 of that metadata.
Problem: When we get to V2 metadata, then we would potentially need a million versions for V2, because each member has to state who they are a member of.
- Each member could also say they are a part of V2... but how?

Next Step:

Create mermaid diagram to show the relationship between the moving parts (big picture view) for discussion & update storage subsystem document

doulikecookiedough commented 1 year ago

Mermaid Diagram for review, summary and a few thoughts:

For now, each annotation file contains a RDFTriple
- They can express a relationship like a PID/GUID's metadata document's location
These annotation files will be indexed and facilitate a PagedAPI of sorts
- Entity/Packaging indexer to be a part of HashStore? Or do we see this as a new library/project?
Perhaps to address a potential issue of millions of duplicate files, we can set the expectation that in our annotation system, the RDFTriple's "Object" value is to be a list (comma separated?) that should be parsed, which will allow the indexer to create multiple relations for one subject without having to duplicate the annotation files.
- Or would it make more sense for each annotation file to potentially contain multiple RDFTriples?
```
<EntityPID1> <ore:aggregatedBy> <OREPID3, OREPID4, OREPID5>
```
vs.
```

graph TD
    subgraph RDFTriple2
        direction BT
        S["EntityPID1"]
        P["ore:aggregatedBy"]
        O["OREPID3"]
    end
    subgraph RDFTriple1
        direction BT
        S1["EntityPID1"]
        P1["cito:documentedBy"]
        O1["EMLPID4"]
    end
    subgraph RDFTriple3
        direction BT
        S2["EntityPID2"]
        P2["prov:wasDerivedFrom"]
        O2["EntityPID1"]
    end
    %% subgraph RDFTriple4
    %%     direction BT
    %%     S3["EMLPID4"]
    %%     P3["dwt:subject"]
    %%     O3["adcad:Hydrology"]
    %% end
    ANNO-1 -. "ANNO-1 content" .-> RDFTriple2
    ANNO-2 -. "ANNO-2 content" .-> RDFTriple3
    subgraph Dataset
        %% C1["CSV-1"]
        %% C2["CSV-2"]
        %% C3["CSV-3"]
        %% C4["CSV-4"]
        %% C5["CSV-5"]
        %% C6["..."]
        %% C7["CSV-1000"]
    end
    O -. "Expresses that a data file is a member of a package" .-> Dataset
    subgraph hs["`**HashStore**`"]
        subgraph /objects
            direction RL
            OBJ-1
            OBJ-2
            OBJ-3
            OBJ-4
            OBJ-5
            OBJ-6
        end
        subgraph /metadata
            direction TB
            META-0
            META-1
            META-2
            %% META-3
            %% META-4
            %% META-5
            ANNO-N
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            ANNO-5
        end
    end
    O1 -. "Expresses that a data file is documented by a metadata file" .-> META-2

Next Step:

Provide context to help decide what format should we use (RDFTriple or JSON-LD).

doulikecookiedough commented 1 year ago

To Do:

Update mermaid diagram for clarity.
- After discussing with the backend team, it's not immediately clear what are files here, how we would retrieve an annotation file, what the subject should be, what the object represents exactly, etc.

doulikecookiedough commented 1 year ago

Updated Diagram (not final):

<dou.mok.rev.1> <obj:locatedAt> <hashstore:dou.mok.rev.1>
<dou.mok.rev.1> <prov:wasDerivedFrom> <dou.mok.1>
<dou.mok.1> <ore:aggregatedBy> <dm.dataset>
<dou.mok.1> <cito:documentedBy> <sysmeta:dou.mok.1>
<dou.mok.rev.1> <ore:aggregatedBy> <dm.dataset>
<dm.dataset> <cito:documentedBy> <sysmeta:dm.dataset>
<dou.mok.1> <cito:documentedBy> <anno:dou.mok.1>
<anno:dou_mok.1> <dwt:subject> <adcad:Hydrology>

flowchart TD
    ds((dm.dataset))
    dm1(dou.mok.1) -- ore:aggregatedBy --> ds
    dm2(dou.mok.rev.1) -- prov:wasDerivedFrom --> dm1
    dm2(dou.mok.rev.1) -- obj:locatedAt --> sha256(hashstore:dou.mok.rev.1)
    dm2 -- ore:aggregatedBy --> ds
    dm1 -- prov:wasDerivedFrom --> a1(anno:dou.mok.1)
    dm1 -- cito:documentedBy --> sm2(sysmeta:dou.mok.1)
    ds -- dwt:subject --> hy1(adcad:Hydrology)
    hy1 .-> ANNO-4
    a1 .-> ANNO-3
    dm1 .-> ANNO-2
    ds .-> ANNO-1
    ds .-> ANNO-0
    ds -- cito:documentedBy --> sm1(sysmeta:dm.dataset)
    sm1 .-> SYSMETA-1
    sm2 .-> SYSMETA-2
    sha256 .-> OBJ-1
    subgraph hs["HashStore"]
        subgraph /objects
            direction RL
            OBJ-1
            OBJ-2
            OBJ-3
            objdot("...")
        end
        subgraph /metadata
            direction RL
            SYSMETA-1
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            SYSMETA-2
            metadot("...")
        end
    end
    classDef orange fill:#f96,stroke-width:3px;
    class ds orange

doulikecookiedough commented 1 year ago

Updated Diagram (continued...):

flowchart TD
    subgraph dmcr["dou.mok.rev.1"]
        direction RL
        dmpr1["dou.mok.rev.1 - prov:wasDerivedFrom - dou.mok.1"]
    end
    dmcr .-> ANNO-4
    subgraph dmc["dou.mok.1"]
        direction RL
        dmp1["dou.mok.1 - ore:aggregatedBy - datasetpkg"]
        dmp2["dou.mok.1 - rdf:type - hsfs:obj\n(metacat/objects/OBJ-1)"]
        dmp3["dou.mok.1 - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-2)"]
        dmp4["dou.mok.1 - hsfs:algo - 'SHA-256'"]
        dmp5["dou.mok.1 - hsfs:checksum - 'a1...f9'"]
    end
    dmc .-> ANNO-3
    subgraph ds["datasetpkg"]
        direction RL
        dsp2["datasetpkg - ore:aggregates - dou.mok.1"]
        dsp3["datasetpkg - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-1)"]
        dsp1["datasetpkg - dwt:subject - adcad:Hydrology"]
    end
    ds .-> ANNO-2
        subgraph hs["HashStore"]
        subgraph /objects
            direction RL
            o1["OBJ-1\n(SHA-256 Hash:\n datasetpkg')"]
            OBJ-2
            OBJ-3
            objdot("...")
        end
        subgraph /metadata
            direction RL
            sys1["SYSMETA-1\n(SHA-256 Hash:\n datasetpkg + format_id')"]
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            sys2["SYSMETA-2\n(SHA-256 Hash:\n dou.mok.1 + format_id')"]
            metadot("...")
        end
    end

Question: 'dou.mok.rev.1' has a triple that describes a provenance relationship between 'dou.mok.1'. Should 'dou.mok.rev.1' automatically assume the same triples due to this relationship? Or will the triples have to be redefined once again regardless of what they are derived from and stored accordingly?

Design Challenges & Questions (cont):

If we had a dataset with a million members, each stating their relationship with the dataset, and we were to make a change to the dataset identifier - how do we maintain the annotation graph between the dataset and members without duplicating all the members?
How do we only index annotation files in HashStore if they will exist alongside sysmeta in /metadata?
Where will this new index exist? A new triple store or in SOLR?
Should we create a custom ontology/vocabulary to assist with our annotation system?

doulikecookiedough commented 1 year ago

Proposed Semantic Data Info Package (SDIP) Diagram from Matt's sketch

flowchart TD
    H13["H13
        H13 type PACKAGE
        H13 contains H11
        H13 contains H12
        H13 contains H10"]
    subgraph ORE
        H12["H12
            H12 type ANNO"]
        H1["H1: ORE"]
        P1["P1: Sysmeta"]
        P1 --> H1
        H12 --> H1
    end
    subgraph EML
        H11["H11
            H11 type ANNO"]
        H2["H2: EML"]
        P2["P2: Sysmeta"]
        H11 --> H2
        P2 --> H2
    end
    H10["H10
        H10 type FOLDER
        H10 contains H6
        H10 contains H9"]
    H13 --> H12
    H13 --> H11
    H13 --> H10
    subgraph Blob1
        H6["H6
            H6 type ANNO
            H6 contains H3"]
        H3["H3: Data"]
        P3["P3: Sysmeta"]
        H6 --> H3
        P3 --> H3
    end    
    H10 --> H6
    H9["H9
        H9 type FOLDER
        H9 contains H8
        H9 contains H8"]
    H10 --> H9
    subgraph Blob2
        H7["H7
            H7 type ANNO
            H7 contains H4
            H4 type BLOB"]
        H4["H4: Data"]
        P4["P4: Sysmeta"]
        H7 --> H4
        P4 --> H4
    end
    H9 --> H7
    subgraph Blob3
        H8["H8
            H8 type ANNO
            H8 contains H5
            H5 type BLOB"]
        H5["H5: Data"]
        P5["P5: Sysmeta"]
        H8 --> H5
        P5 --> H5
    end
    H9 --> H8 
    classDef cyan fill:#7ff;
    class H13,H12,H11,H10,H6,H9,H7,H8 cyan
    classDef mage fill:#ff7ffe;
    class P1,P2,P3,P4,P5 mage
    classDef lime fill:#dfffda;
    class H1,H2 lime

doulikecookiedough commented 1 year ago

Some more thoughts on annotations & the impacts of a change on a large dataset

Assumptions/Background:
- A Paging API/indexer first has to retrieve the head annotation file that describes a package, in order to begin walking the annotation tree and discovering the members associated with it
- Instead of a member's annotation file declaring what package it is a part of, it will declare that is belongs to a unique_annotation_tree_id
  - This unique id never changes for a large dataset. A new database may be needed to keep track of large datasets, and we would have to agree on how to come up with this id.
    - The Paging API/indexer will be responsible for establishing the connection between a package name/identifier and its respective unique_annotation_tree_id for the front-end

A package name/id change for a dataset with a million files

We could do this by first retrieving the head annotation file of the package and then add/obsolete triple statements like what is currently being done with data objects.
- This will allow the previous package to still be discoverable, and establish the new association within the same head annotation file.
We would not have to duplicate a million data files - the tree associated with the unique_annotation_tree_id hasn't technically changed at this point
- Because all data members are declaring they belong to a unique_annotation_tree_id, and not a specific package name, we would instead have to re-index a million files to establish this new package...
  - Our triple store will grow instead of our object store.

Dataset member updates in a package with a million files

Continuing with the example above, a change to the dataset members would then only involve replacing the affected data objects, updating the sysmeta, and then updating the respective annotation files' triple statements.
After any change, we call the indexer to update the triple store, which begins by looking for the head annotation file of the package, and then traverses the tree to index/re-index the dataset
If we optimize how we perform the indexing, updating a large dataset could be a routine and efficient process.
- It would be better not to walk an entire tree just to re-index a few files. However, the reality might just be that we have to walk the entire tree to be safe that we did not miss anything. Pausing my thoughts here for now to reflect.

doulikecookiedough commented 9 months ago

Closing previous discussion/issue for Annotation Design: N-Triple vs JSON-LD Discussion and will continue discussions here as progress is made with the greater team regarding how to handle large packages.