Open doulikecookiedough opened 1 year ago
Thanks for getting this started, @doulikecookiedough . We should discuss, as I don't understand the parent/child nomenclature you are mentioning above. TO kick things off, I think it is useful to list examples things we might want to store in an annotation store. Here are some expressed as triples:
<EntityPID1> <ore:aggregatedBy> <OREPID3>
# Expresses that a data file is a member of a package (primary use case, typically in an ORE doc)
<EntityPID1> <cito:documentedBy> <EMLPID4>
# Expresses that a data file is documented by a metadata file (typically in an ORE doc)
<EntityPID2> <prov:wasDerivedFrom> <EntityPID1>
# Expresses a provenance relationship
<EMLPID4> <dwt:subject> <adcad:Hydrology>
# Expresses that a metadata file has a subject of a particular discipline; many other classification-style annotations like this could be used
It might be best to have these grouped as a single file graph for each object (serialized in JSON-LD or RDF formats), which would be indexed like other metadata. So in this case, we would have 3 annotation files, one each for EntityPID1
, EntityPID2
, and EMLPID4
. In this example, annotations are stored in the annotation file for the subject PID. The DataONE indexer would then be responsible for indexing these annotation files anytime they change, and adding the annotations to the appropriate index systems (currently our SOLR object index, but we will likely need another for the entity/package mapping).
@mbjones Thank you for the prompt feedback! When I was reviewing how we would implement Annotations
, I was specifically looking at how we could store the science-metadata.xml
document found in the /metadata
folder when downloading datasets. My understanding (or misunderstanding now...) is that I thought these types of files would be the extremely large files we would have to break down into pieces that could be retrieved/treated as a whole.
To clarify, using your example above:
annotation file
for the subjectPID
store_metadata(subjectPID, formatId)
annotation file
for the subject PID, would contain 3 lines (annotations), one for each of EntityPID1
, EntityPID2
, and EMLPID4
.
If we're on the same page now, I'm wondering how HashStore plays a role in the Annotation implementation...? It would appear that the calling app would still be working with metadata in HashStore with the same Public API calls. If not, should we save further discussion here for the backend dev meeting to discuss so Robyn/Rushi can also get additional context?
The content of this annotation file for the subject PID, would contain 3 lines
Maybe, or maybe not. Depends on the format. If stored as JSON-LD, it would likely have more lines, and certainly a lot more if in RDF/XML format.
Notes:
Background:
Design Goal:
Annotation General Design/Infrastructure Discussion:
Potential Issue to Ponder:
Next Step:
Mermaid Diagram for review, summary and a few thoughts:
PID/GUID
's metadata document's locationPerhaps to address a potential issue of millions of duplicate files, we can set the expectation that in our annotation system, the RDFTriple's "Object" value is to be a list (comma separated?) that should be parsed, which will allow the indexer to create multiple relations for one subject without having to duplicate the annotation files.
<EntityPID1> <ore:aggregatedBy> <OREPID3, OREPID4, OREPID5>
vs.
graph TD
subgraph RDFTriple2
direction BT
S["EntityPID1"]
P["ore:aggregatedBy"]
O["OREPID3"]
end
subgraph RDFTriple1
direction BT
S1["EntityPID1"]
P1["cito:documentedBy"]
O1["EMLPID4"]
end
subgraph RDFTriple3
direction BT
S2["EntityPID2"]
P2["prov:wasDerivedFrom"]
O2["EntityPID1"]
end
%% subgraph RDFTriple4
%% direction BT
%% S3["EMLPID4"]
%% P3["dwt:subject"]
%% O3["adcad:Hydrology"]
%% end
ANNO-1 -. "ANNO-1 content" .-> RDFTriple2
ANNO-2 -. "ANNO-2 content" .-> RDFTriple3
subgraph Dataset
%% C1["CSV-1"]
%% C2["CSV-2"]
%% C3["CSV-3"]
%% C4["CSV-4"]
%% C5["CSV-5"]
%% C6["..."]
%% C7["CSV-1000"]
end
O -. "Expresses that a data file is a member of a package" .-> Dataset
subgraph hs["`**HashStore**`"]
subgraph /objects
direction RL
OBJ-1
OBJ-2
OBJ-3
OBJ-4
OBJ-5
OBJ-6
end
subgraph /metadata
direction TB
META-0
META-1
META-2
%% META-3
%% META-4
%% META-5
ANNO-N
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
ANNO-5
end
end
O1 -. "Expresses that a data file is documented by a metadata file" .-> META-2
Next Step:
To Do:
Updated Diagram (not final):
<dou.mok.rev.1> <obj:locatedAt> <hashstore:dou.mok.rev.1>
<dou.mok.rev.1> <prov:wasDerivedFrom> <dou.mok.1>
<dou.mok.1> <ore:aggregatedBy> <dm.dataset>
<dou.mok.1> <cito:documentedBy> <sysmeta:dou.mok.1>
<dou.mok.rev.1> <ore:aggregatedBy> <dm.dataset>
<dm.dataset> <cito:documentedBy> <sysmeta:dm.dataset>
<dou.mok.1> <cito:documentedBy> <anno:dou.mok.1>
<anno:dou_mok.1> <dwt:subject> <adcad:Hydrology>
flowchart TD
ds((dm.dataset))
dm1(dou.mok.1) -- ore:aggregatedBy --> ds
dm2(dou.mok.rev.1) -- prov:wasDerivedFrom --> dm1
dm2(dou.mok.rev.1) -- obj:locatedAt --> sha256(hashstore:dou.mok.rev.1)
dm2 -- ore:aggregatedBy --> ds
dm1 -- prov:wasDerivedFrom --> a1(anno:dou.mok.1)
dm1 -- cito:documentedBy --> sm2(sysmeta:dou.mok.1)
ds -- dwt:subject --> hy1(adcad:Hydrology)
hy1 .-> ANNO-4
a1 .-> ANNO-3
dm1 .-> ANNO-2
ds .-> ANNO-1
ds .-> ANNO-0
ds -- cito:documentedBy --> sm1(sysmeta:dm.dataset)
sm1 .-> SYSMETA-1
sm2 .-> SYSMETA-2
sha256 .-> OBJ-1
subgraph hs["HashStore"]
subgraph /objects
direction RL
OBJ-1
OBJ-2
OBJ-3
objdot("...")
end
subgraph /metadata
direction RL
SYSMETA-1
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
SYSMETA-2
metadot("...")
end
end
classDef orange fill:#f96,stroke-width:3px;
class ds orange
Updated Diagram (continued...):
flowchart TD
subgraph dmcr["dou.mok.rev.1"]
direction RL
dmpr1["dou.mok.rev.1 - prov:wasDerivedFrom - dou.mok.1"]
end
dmcr .-> ANNO-4
subgraph dmc["dou.mok.1"]
direction RL
dmp1["dou.mok.1 - ore:aggregatedBy - datasetpkg"]
dmp2["dou.mok.1 - rdf:type - hsfs:obj\n(metacat/objects/OBJ-1)"]
dmp3["dou.mok.1 - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-2)"]
dmp4["dou.mok.1 - hsfs:algo - 'SHA-256'"]
dmp5["dou.mok.1 - hsfs:checksum - 'a1...f9'"]
end
dmc .-> ANNO-3
subgraph ds["datasetpkg"]
direction RL
dsp2["datasetpkg - ore:aggregates - dou.mok.1"]
dsp3["datasetpkg - cito:documentedBy - hsfs:sysmeta\n(metacat/metadata/SYSMETA-1)"]
dsp1["datasetpkg - dwt:subject - adcad:Hydrology"]
end
ds .-> ANNO-2
subgraph hs["HashStore"]
subgraph /objects
direction RL
o1["OBJ-1\n(SHA-256 Hash:\n datasetpkg')"]
OBJ-2
OBJ-3
objdot("...")
end
subgraph /metadata
direction RL
sys1["SYSMETA-1\n(SHA-256 Hash:\n datasetpkg + format_id')"]
ANNO-0
ANNO-1
ANNO-2
ANNO-3
ANNO-4
sys2["SYSMETA-2\n(SHA-256 Hash:\n dou.mok.1 + format_id')"]
metadot("...")
end
end
Design Challenges & Questions (cont):
Proposed Semantic Data Info Package (SDIP) Diagram from Matt's sketch
flowchart TD
H13["H13
H13 type PACKAGE
H13 contains H11
H13 contains H12
H13 contains H10"]
subgraph ORE
H12["H12
H12 type ANNO"]
H1["H1: ORE"]
P1["P1: Sysmeta"]
P1 --> H1
H12 --> H1
end
subgraph EML
H11["H11
H11 type ANNO"]
H2["H2: EML"]
P2["P2: Sysmeta"]
H11 --> H2
P2 --> H2
end
H10["H10
H10 type FOLDER
H10 contains H6
H10 contains H9"]
H13 --> H12
H13 --> H11
H13 --> H10
subgraph Blob1
H6["H6
H6 type ANNO
H6 contains H3"]
H3["H3: Data"]
P3["P3: Sysmeta"]
H6 --> H3
P3 --> H3
end
H10 --> H6
H9["H9
H9 type FOLDER
H9 contains H8
H9 contains H8"]
H10 --> H9
subgraph Blob2
H7["H7
H7 type ANNO
H7 contains H4
H4 type BLOB"]
H4["H4: Data"]
P4["P4: Sysmeta"]
H7 --> H4
P4 --> H4
end
H9 --> H7
subgraph Blob3
H8["H8
H8 type ANNO
H8 contains H5
H5 type BLOB"]
H5["H5: Data"]
P5["P5: Sysmeta"]
H8 --> H5
P5 --> H5
end
H9 --> H8
classDef cyan fill:#7ff;
class H13,H12,H11,H10,H6,H9,H7,H8 cyan
classDef mage fill:#ff7ffe;
class P1,P2,P3,P4,P5 mage
classDef lime fill:#dfffda;
class H1,H2 lime
Some more thoughts on annotations & the impacts of a change on a large dataset
unique_annotation_tree_id
unique_annotation_tree_id
for the front-endA package name/id change for a dataset with a million files
unique_annotation_tree_id
hasn't technically changed at this point
unique_annotation_tree_id
, and not a specific package name, we would instead have to re-index a million files to establish this new package...
Dataset member updates in a package with a million files
Closing previous discussion/issue for Annotation Design: N-Triple vs JSON-LD Discussion and will continue discussions here as progress is made with the greater team regarding how to handle large packages.
Questions & Todo:
/hashstore/metadata
? JSON-LD or EML?Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):
HashStore annotation
is a mapping document that should consist of a single parent member and a list that represents the child membershashstore/metadata
is formed by calculating the SHA-256 hex digest of a givenpid
andformatId
hashstore/metadata
pid
,formatId
and the string "parent".Ex. sha-256(pid + formatId + "parent")
hashstore/metadata
as the valuepid
,formatId
and(int) key
.Ex. sha-256(pid + formatId + 0)
where 0 is the first table in the dataset