DataONEorg / dataone

DataONE information and general-purpose issue tracking
Apache License 2.0
2 stars 0 forks source link

support indexing linked metadata from schema.org entries #10

Open mbjones opened 3 years ago

mbjones commented 3 years ago

When we harvest a data package from schema.org, we create a canonical copy of the schema.org JSON-LD, and index that. If the SO entry contains a link to a more detailed metadata record as proposed int he SOSO guidelines, then we should also index that content. To do so means we need to resolve conflicts and issues of precedence (e.g., if the two metadata sources provide different titles), and determine how to merge them into a single package so they do not show up in the index as distinct data packages. This could involve creating an ORE and having both metadata docs be a member of the package, or other solutions.

Dave and I had a slack conversation on this, some of which is included below for context.

Matt Jones  11:25 AM
Hey @davev with the new schema.org harvester, are we also picking up and indexing related metadata records (like linked ISO or EML records)?
11:25
I’ve been talking about that as a strength of DataONE’s indexer, but I’ve recently realized that maybe under the new approach things would work differently
11:26
I think it would be good if we could index all metadata content if its linked in the SO record. Thoughts?

Dave Vieglais  11:27 AM
we could, it wouldn’t be much more work, except that it’s a bit confusing having multiple metadata

Matt Jones  11:27 AM
true

Dave Vieglais  11:27 AM
what would that even mean? treat the SO like a resourcemap?

Matt Jones  11:27 AM
but for groups that have both, seems like it would be a win
11:28
yeah, not sure. the big question is when the two metadata documents say different things — like SO and EML have different titles

Dave Vieglais  11:28 AM
yeah

Matt Jones  11:28 AM
it would be nice to treat them as additive
11:30
maybe SO is designated as primary… to resolve conflicts.. if that is how the records were harvested
11:30
when you pull in a SO dataset, do you create an ORE in GMN?

Dave Vieglais  11:30 AM
not right now, it’s just metadata. It’s easy enough to create the ORE, but version management gets painful

Matt Jones  11:31 AM
right now the ORE and other metadata documents are additive in terrms of what is indexed, but I don’t think they overlap in content much. but we’ve talked about allowing that, so that PROV and semantic annotations can go in either the ORE or the metadata doc. Seems like the same issues exist with SO

Dave Vieglais  11:32 AM
yep. SO is just more metadata-ish than ORE

...

Matt Jones  11:37 AM
for IEDA nodes, are you indexing schema.org and ISO?

Dave Vieglais  11:37 AM
they are on the old pattern, which uses SO as a way to find the ISO, which is then retrieved, sys meta created, and served up to the CNs for indexing

Matt Jones  11:38 AM
ah, so the SO is discarded?

Dave Vieglais  11:39 AM
Yes I think so. Perhaps identifier and a couple other properties retained for sysmeta

Matt Jones  11:39 AM
it seems to me that the right thing for us to do over the long run is to index both, and have a well-established precedence for conflicts. Maybe we’re not ready to offer this to NEON yet….

Dave Vieglais  11:40 AM
we need to at least have a clear implementation pattern as to what goes where.

Matt Jones  11:41 AM
the SOSO guidelines say how to link in the extra metadata, so that seems like something we should follow and I think it would be pretty clear. (edited) 
11:41
maybe we could add some language there about precedence for harvesters

Dave Vieglais  11:42 AM
ah, good point

Matt Jones  11:43 AM
which should be preferred for values — ISO/EML/etc, or the SO fields — when info is duplicated?

Dave Vieglais  11:43 AM
how would we handle that as an object in DataONE though? There’s two metadata docs, with separate PIDs that generate a single index record

Matt Jones  11:44 AM
yeah, that’ why I asked about the ORE
11:44
if we harvest it as a package, we could put both metadata docs in and link them via an ORE
11:44
and index them both with a precedence order

Dave Vieglais  11:45 AM
But they get indexed to separate index docs

Matt Jones  11:45 AM
we’ve always theoretically had the ability to have multiple metadata docs in a package

Dave Vieglais  11:45 AM
so there’s no precedence to consider - each populates a different index record

Matt Jones  11:46 AM
so the package shows up twice in searches? (edited) 

Dave Vieglais  11:46 AM
potentially I guess - what happens now if there’s two metadata docs in one package?

Matt Jones  11:47 AM
I’m not sure we have encountered it
11:47
we’ve talked about doing it, but so far I think client tools avoid doing so
11:48
another use case for it is to have dataset metadata for data files (EML/ISO) and software metadata for software files (e.g., CodeMeta)

Dave Vieglais  11:49 AM
I guess the ORE really represents the single thing that is actually discovered

Matt Jones  11:49 AM
yeah

Dave Vieglais  11:49 AM
kind of flips the UI around a bit

Matt Jones  11:49 AM
but the indexer treats the METADATA records as primary, and then pulls in the ORE later to link to other parts of the package
11:49
so I think we straddle both models a bit
11:50
in theory I think the package is the right metaphor for an “entry” in our index
11:50
i.e., we should be indexing complex data packages and their content

Dave Vieglais  11:52 AM
yeah, resulting in one index row per package, with lots of properties on that row.

Matt Jones  11:52 AM
this is also the root of the DOI assignment issue between LTER and our other systems. In Metacat, we assign the DOI to the metadata doc, and it is used in the citation. In LTER they assign the DOI to the package, and it doesn’t show up properly in our citation. There’s an old issue around on this.

Dave Vieglais  11:54 AM
Yeah, I wondered about that. DOI should really point to the resource map, since from there you can discover the pieces of the package. imho

Matt Jones  11:54 AM
https://redmine.dataone.org/issues/8077
11:55
yeah, it just came from our historical use of EML as the “package” listing, with entities referenced in the EML, and ORE only added in later

Dave Vieglais  11:55 AM
yep

Matt Jones  11:56 AM
ok, well, thoughts on how I should respond to James given this context?
11:56
maybe I could tell them SO is an option, but then their EML wouldn’t be indexed, but that we hope to support both in the future?
11:57
or I could tell them SO is an option, and we could discuss the ramifications on a call?
11:58
sounds like its going to be low priority for them to keep their Metacat running

Dave Vieglais  11:59 AM
probably the second choice. SO option and a call to discuss consequences

Matt Jones  12:00 PM
sounds good
12:00
should we open an issue on resolving the multiple metadata problem in SO links in the future?

Dave Vieglais  12:01 PM
yeah, good point

Matt Jones  12:01 PM
a lot of this slack convo would be good background
12:01
where would that go? d1_cn_index_processor? (edited) 

Dave Vieglais  12:03 PM
I’d be inclined to drop it in dataone

Matt Jones  12:04 PM
ah the top level repo?
12:04
ok, I’lll enter it there
New

Dave Vieglais  12:04 PM
there’s some other stuff in there at about the same level - and this SO+ thing touches on a bunch of stuff through the whole stack
12:04
thanks
amoeba commented 3 years ago

As for other solutions, I wonder how well it'd work if, when we encounter a reference to a more detailed record (via a schema:subjectOf triple with a suitable schema:encodingFormat for our systems), we just harvest and use that as the primary metadata record for dataset/DataPackage.

If we did want to hang on to the original JSON-LD and any other alternate formats we didn't havest, an appropriate place might be in the ORE using rdfs:seeAlso or ore:similarTo (See Section 4.4 in the ORE Spec).