DataONEorg / hashstore

HashStore, a hash-based object store for DataONE data packages
Apache License 2.0
1 stars 1 forks source link

Annotation Design Discussion #56

Open doulikecookiedough opened 1 year ago

doulikecookiedough commented 1 year ago

Questions & Todo:

Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):

---
title: HashStoreAnnotation Class
---
classDiagram
    direction RL
    class HashStoreAnnotation{
        +String Parent
        +List~Dict/KVP~ Children
        +setParent(string)
        +setChildren(List)
        +getContent()
        +setContent()
        +getChildrenTotal()
    }
Example/flow to store an annotation document:

hs_annotation = HashStoreAnnotation()

// Get and store parent content
// Get and store children content

// Get parent location
dataset_parent = sha-256(pid + formatId + "parent")
// Create child list
dataset_children = [
    {0: sha-256(pid + formatId + 0)},
    {1: sha-256(pid + formatId + 1)},
    ...
]
hs_annotation.setParent(dataset_parent)
hs_annotation.setChildren(dataset_children)

// getContent() will format the document to be written based on the chosen format
hs_annotation_content = hs_annotation.getContent()

hashstore.store_metadata(pid, hs_annotation_content, formatId) 
Example/flow to work with/retrieve an annotation document:

// Retrieve the mapping document
hs_annotation_stream = hashstore.retrieve_metadata(pid, formatId)
hs_annotation = HashStoreAnnotation.setContent(hs_annotation_stream)
hsa_parent = hs_annotation.parent
hsa_children = hs_annotation.children

// Iterate over the first 1000 table items
for i in range(0, 1000):
     rel_path = shard(hsa_children[i])
     location = `/hashstore/metadata/` + rel_path
     // ... Do what we will with each child element
mbjones commented 1 year ago

Thanks for getting this started, @doulikecookiedough . We should discuss, as I don't understand the parent/child nomenclature you are mentioning above. TO kick things off, I think it is useful to list examples things we might want to store in an annotation store. Here are some expressed as triples:

It might be best to have these grouped as a single file graph for each object (serialized in JSON-LD or RDF formats), which would be indexed like other metadata. So in this case, we would have 3 annotation files, one each for EntityPID1, EntityPID2, and EMLPID4. In this example, annotations are stored in the annotation file for the subject PID. The DataONE indexer would then be responsible for indexing these annotation files anytime they change, and adding the annotations to the appropriate index systems (currently our SOLR object index, but we will likely need another for the entity/package mapping).

doulikecookiedough commented 1 year ago

@mbjones Thank you for the prompt feedback! When I was reviewing how we would implement Annotations, I was specifically looking at how we could store the science-metadata.xml document found in the /metadata folder when downloading datasets. My understanding (or misunderstanding now...) is that I thought these types of files would be the extremely large files we would have to break down into pieces that could be retrieved/treated as a whole.

To clarify, using your example above:

If we're on the same page now, I'm wondering how HashStore plays a role in the Annotation implementation...? It would appear that the calling app would still be working with metadata in HashStore with the same Public API calls. If not, should we save further discussion here for the backend dev meeting to discuss so Robyn/Rushi can also get additional context?

mbjones commented 1 year ago

The content of this annotation file for the subject PID, would contain 3 lines

Maybe, or maybe not. Depends on the format. If stored as JSON-LD, it would likely have more lines, and certainly a lot more if in RDF/XML format.

doulikecookiedough commented 1 year ago

Notes:

Background:

Design Goal:

Annotation General Design/Infrastructure Discussion:

Potential Issue to Ponder:

Next Step:

doulikecookiedough commented 1 year ago

Mermaid Diagram for review, summary and a few thoughts:

graph TD
    subgraph RDFTriple2
        direction BT
        S["EntityPID1"]
        P["ore:aggregatedBy"]
        O["OREPID3"]
    end
    subgraph RDFTriple1
        direction BT
        S1["EntityPID1"]
        P1["cito:documentedBy"]
        O1["EMLPID4"]
    end
    subgraph RDFTriple3
        direction BT
        S2["EntityPID2"]
        P2["prov:wasDerivedFrom"]
        O2["EntityPID1"]
    end
    %% subgraph RDFTriple4
    %%     direction BT
    %%     S3["EMLPID4"]
    %%     P3["dwt:subject"]
    %%     O3["adcad:Hydrology"]
    %% end
    ANNO-1 -. "ANNO-1 content" .-> RDFTriple2
    ANNO-2 -. "ANNO-2 content" .-> RDFTriple3
    subgraph Dataset
        %% C1["CSV-1"]
        %% C2["CSV-2"]
        %% C3["CSV-3"]
        %% C4["CSV-4"]
        %% C5["CSV-5"]
        %% C6["..."]
        %% C7["CSV-1000"]
    end
    O -. "Expresses that a data file is a member of a package" .-> Dataset
    subgraph hs["`**HashStore**`"]
        subgraph /objects
            direction RL
            OBJ-1
            OBJ-2
            OBJ-3
            OBJ-4
            OBJ-5
            OBJ-6
        end
        subgraph /metadata
            direction TB
            META-0
            META-1
            META-2
            %% META-3
            %% META-4
            %% META-5
            ANNO-N
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            ANNO-5
        end
    end
    O1 -. "Expresses that a data file is documented by a metadata file" .-> META-2

Next Step:

doulikecookiedough commented 1 year ago

To Do:

doulikecookiedough commented 1 year ago

Updated Diagram (not final):

<dou.mok.rev.1> <obj:locatedAt> <hashstore:dou.mok.rev.1>
<dou.mok.rev.1> <prov:wasDerivedFrom> <dou.mok.1>
<dou.mok.1> <ore:aggregatedBy> <dm.dataset>
<dou.mok.1> <cito:documentedBy> <sysmeta:dou.mok.1>
<dou.mok.rev.1> <ore:aggregatedBy> <dm.dataset>
<dm.dataset> <cito:documentedBy> <sysmeta:dm.dataset>
<dou.mok.1> <cito:documentedBy> <anno:dou.mok.1>
<anno:dou_mok.1> <dwt:subject> <adcad:Hydrology>
flowchart TD
    ds((dm.dataset))
    dm1(dou.mok.1) -- ore:aggregatedBy --> ds
    dm2(dou.mok.rev.1) -- prov:wasDerivedFrom --> dm1
    dm2(dou.mok.rev.1) -- obj:locatedAt --> sha256(hashstore:dou.mok.rev.1)
    dm2 -- ore:aggregatedBy --> ds
    dm1 -- prov:wasDerivedFrom --> a1(anno:dou.mok.1)
    dm1 -- cito:documentedBy --> sm2(sysmeta:dou.mok.1)
    ds -- dwt:subject --> hy1(adcad:Hydrology)
    hy1 .-> ANNO-4
    a1 .-> ANNO-3
    dm1 .-> ANNO-2
    ds .-> ANNO-1
    ds .-> ANNO-0
    ds -- cito:documentedBy --> sm1(sysmeta:dm.dataset)
    sm1 .-> SYSMETA-1
    sm2 .-> SYSMETA-2
    sha256 .-> OBJ-1
    subgraph hs["HashStore"]
        subgraph /objects
            direction RL
            OBJ-1
            OBJ-2
            OBJ-3
            objdot("...")
        end
        subgraph /metadata
            direction RL
            SYSMETA-1
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            SYSMETA-2
            metadot("...")
        end
    end
    classDef orange fill:#f96,stroke-width:3px;
    class ds orange
doulikecookiedough commented 1 year ago

Updated Diagram (continued...):

flowchart TD
    subgraph dmcr["dou.mok.rev.1"]
        direction RL
        dmpr1["dou.mok.rev.1 - prov:wasDerivedFrom - dou.mok.1"]
    end
    dmcr .-> ANNO-4
    subgraph dmc["dou.mok.1"]
        direction RL
        dmp1["dou.mok.1 - ore:aggregatedBy - datasetpkg"]
        dmp2["dou.mok.1 - rdf:type - hsfs:obj\n(metacat/objects/OBJ-1)"]
        dmp3["dou.mok.1 - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-2)"]
        dmp4["dou.mok.1 - hsfs:algo - 'SHA-256'"]
        dmp5["dou.mok.1 - hsfs:checksum - 'a1...f9'"]
    end
    dmc .-> ANNO-3
    subgraph ds["datasetpkg"]
        direction RL
        dsp2["datasetpkg - ore:aggregates - dou.mok.1"]
        dsp3["datasetpkg - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-1)"]
        dsp1["datasetpkg - dwt:subject - adcad:Hydrology"]
    end
    ds .-> ANNO-2
        subgraph hs["HashStore"]
        subgraph /objects
            direction RL
            o1["OBJ-1\n(SHA-256 Hash:\n datasetpkg')"]
            OBJ-2
            OBJ-3
            objdot("...")
        end
        subgraph /metadata
            direction RL
            sys1["SYSMETA-1\n(SHA-256 Hash:\n datasetpkg + format_id')"]
            ANNO-0
            ANNO-1
            ANNO-2
            ANNO-3
            ANNO-4
            sys2["SYSMETA-2\n(SHA-256 Hash:\n dou.mok.1 + format_id')"]
            metadot("...")
        end
    end

Design Challenges & Questions (cont):

doulikecookiedough commented 1 year ago

Proposed Semantic Data Info Package (SDIP) Diagram from Matt's sketch

flowchart TD
    H13["H13
        H13 type PACKAGE
        H13 contains H11
        H13 contains H12
        H13 contains H10"]
    subgraph ORE
        H12["H12
            H12 type ANNO"]
        H1["H1: ORE"]
        P1["P1: Sysmeta"]
        P1 --> H1
        H12 --> H1
    end
    subgraph EML
        H11["H11
            H11 type ANNO"]
        H2["H2: EML"]
        P2["P2: Sysmeta"]
        H11 --> H2
        P2 --> H2
    end
    H10["H10
        H10 type FOLDER
        H10 contains H6
        H10 contains H9"]
    H13 --> H12
    H13 --> H11
    H13 --> H10
    subgraph Blob1
        H6["H6
            H6 type ANNO
            H6 contains H3"]
        H3["H3: Data"]
        P3["P3: Sysmeta"]
        H6 --> H3
        P3 --> H3
    end    
    H10 --> H6
    H9["H9
        H9 type FOLDER
        H9 contains H8
        H9 contains H8"]
    H10 --> H9
    subgraph Blob2
        H7["H7
            H7 type ANNO
            H7 contains H4
            H4 type BLOB"]
        H4["H4: Data"]
        P4["P4: Sysmeta"]
        H7 --> H4
        P4 --> H4
    end
    H9 --> H7
    subgraph Blob3
        H8["H8
            H8 type ANNO
            H8 contains H5
            H5 type BLOB"]
        H5["H5: Data"]
        P5["P5: Sysmeta"]
        H8 --> H5
        P5 --> H5
    end
    H9 --> H8 
    classDef cyan fill:#7ff;
    class H13,H12,H11,H10,H6,H9,H7,H8 cyan
    classDef mage fill:#ff7ffe;
    class P1,P2,P3,P4,P5 mage
    classDef lime fill:#dfffda;
    class H1,H2 lime
doulikecookiedough commented 1 year ago

Some more thoughts on annotations & the impacts of a change on a large dataset

A package name/id change for a dataset with a million files

Dataset member updates in a package with a million files

doulikecookiedough commented 9 months ago

Closing previous discussion/issue for Annotation Design: N-Triple vs JSON-LD Discussion and will continue discussions here as progress is made with the greater team regarding how to handle large packages.