NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 13 forks source link

Update Metacat's sitemap implementation #1263

Closed amoeba closed 5 years ago

amoeba commented 6 years ago

To support features we're adding to MetacatUI, we need to take another look at the Sitemap implementation in Metacat. In particular, we want to support Member Nodes running Metacat w/ MetacatUI, and also the DataONE Coordinating Nodes running Metacat w/ MetacatUI as well. Together with the features we're adding to MetacatUI:

Updating the Sitemap implementation will allow Metacat users running MetacatUI to be indexed by Google which would be awesome.

My plan is to:

We have an old ticket, https://github.com/NCEAS/metacat/issues/563, that I think I'll close as this ticket obviates that one.

amoeba commented 6 years ago

I see two big challenges:

https://github.com/NCEAS/metacat/blob/7c255728da43aae0e9382dc82b39ba9d77fbeb62/src/edu/ucsb/nceas/metacat/Sitemap.java#L120

is too narrow a limit since not all Member Nodes only use EML. I think a better limit here is that the formatID must be one of the registered formats in the XML catalog.

mbjones commented 6 years ago

@amoeba Did you note that metacat.properties already has the guid.ezid.uritemplate.metadata property, which is used to construct the canonical URL to be registered for DOI redirection?

Currently, the default is:

guid.ezid.uritemplate.metadata=/metacatui/#view/<IDENTIFIER>

But it gets changed to something sensible, and this gets used to build a landing page URI from the metacat deployment host and port and the tomcat context, something like 'https://knb.ecoinformatics.org/#view/doi:10.5063/F1RR1WFT`, which is sent off to DataCite as the redirect URI for that DOI.

In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.

I know you want to keep Metacat and MetacatUI separate, but in reality it seems that Metacat does need to know something about its URI space, especially given that Metacat is likely the component to handle content negotiation to even determine if MetacatUI gets invoked, rather than returning another format like EML, ISO, RDF, etc. In addition to its role in keeping DOI redirects updated.

mbjones commented 6 years ago

Also, in looking at the code you linked, I see that we are using Writer.write to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.

amoeba commented 6 years ago

Ah, thanks @mbjones I didn't see that. It'd make sense not to duplicate that. Metacat can already grab the metacatui deployment context programmatically so my approach was to make the configuration for sitemap purposes optional and default to <%= MetacatUIContext %>/view/<%= PID %> with what's in the middle being the configurable part.

In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.

Yeah, totally. I think the Metacat propert(y|ies) controlling this could support linking to URLs under the same web root and URLs under another.

I know you want to keep Metacat and MetacatUI separate,

This was an aspiration that I no longer have. I remembered/found that Metacat asks the user to provide the MetacatUI context so we're already linking the two pieces of software together.

Also, in looking at the code you linked, I see that we are using Writer.write to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.

Good catch! I always forget about those types of XML issues.

amoeba commented 6 years ago

Made lots of progress on this these last few weeks. I now have a better Metacat dev setup than before and I can somewhat confidently re-deploy and test my code changes.

I ran into one thing while validating a test sitemap which is that the default behavior of sitemap validators is to require the sitemap be served from the same path as the URLs it's serving. This presents a conflict if we want to serve a sitemap at an MN deployment with canonical (DataONE) URIs in it.

For example, with a sitemap at http://example.com/catalog/sitemap.xml, all URLs in the sitemap need to start with exactly http://example.com/catalog/. So if we want to use the canonical (dataone.org) URI for sitemap entries, we'd have to submit a cross-origin sitemap which looks doable but sitemap.org doesn't really recommend.

For MN deployments, I think it'd be best to serve the sitemap/sitemapindex at the root with URLs under MetacatUI's URL space (e.g., https://arcticdata.io/sitemap.xml). For CN deployments, we already serve a sitemap.xml file at the root so I think we would would probably want to serve the sitemap/sitemapindex at dataone.org/datasets (e.g., https://dataone.org/datasets/sitemap.xml).

amoeba commented 6 years ago

I still wanna update docs to match these changes but this is ready for review. Metacat's Sitemap functionality has been significantly refactored to support our modern use of it. Please take a look!

Next steps:

amoeba commented 6 years ago

Two notes to add from talking with @mbjones on Slack just now:

  1. The SQL query that decides what documents to put in the sitemap only includes the most recent document in a chain. This means old versions won't show up: Maybe it's more correct to include all versions? It's hard to say what Google wants. Should the sitemap match the Search Catalog (non-obsoleted, non-archived metadata objects) or should it be exhaustive?
  2. The SQL includes archived documents. It probably should not, right?

Feedback on those two bullet points would be greatly appreciated. When I'm back from traveling the next two weeks I can certainly take a deeper look at (1).

laurenwalker commented 6 years ago

I think we would not want archived objects to be indexed by Google, since archiving an object is like saying "make this object less discoverable."

I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.

amoeba commented 6 years ago

I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.

I think this encapsulates my logic but I feel pretty torn too. If you look at a site like GitHub, which Google crawls, I imagine they refresh the crawled and searchable content every so often so that searches are essentially only searching the latest version of the README (and other info).

I Googled around to get a sense of this issue last week and didn't find a whole lot but your point about Google potentially returning old versions sounds like a situation we don't want. Sitemaps support relevance ranks, I believe, so I wonder if we could place a high rank on the latest version and a lower rank on all non-latest versions.

amoeba commented 6 years ago

I did some soul searching (and talking to Mark Servilla), and I think it makes the most sense to restrict the records that go into sitemaps to:

This aligns the sitemap content to the content a user would see in the Metacat search catalog.

I also did some performance testing and I don't see any issues: Sitemaps generate in a second or two when there are ~50,000 documents. The SQL query is not great (See Query Plan below) but still isn't that bad. The default sitemap generation interval is a day so performance isn't that critical anyway. I am discussing with DataONE on doing a proper test for their use of the feature but haven't gotten set up to test it on a DataONE CN yet.

I'm going to PR this and so we can finalize this.

Query Plan ``` Sort (cost=60500.98..60860.32 rows=143735 width=54) Sort Key: systemmetadata.date_uploaded -> Hash Join (cost=19341.84..43273.88 rows=143735 width=54) Hash Cond: (identifier.guid = systemmetadata.guid) -> Hash Join (cost=6636.25..22973.40 rows=144141 width=91) Hash Cond: (xml_access.guid = identifier.guid) -> Seq Scan on xml_access (cost=0.00..10562.03 rows=144190 width=45) Filter: (((principal_name)::text = 'public'::text) AND ((perm_type)::text = 'allow'::text)) -> Hash (cost=3591.89..3591.89 rows=142989 width=46) -> Seq Scan on identifier (cost=0.00..3591.89 rows=142989 width=46) -> Hash (cost=9530.25..9530.25 rows=142587 width=53) -> Hash Join (cost=4.05..9530.25 rows=142587 width=53) Hash Cond: ((systemmetadata.object_format)::text = (xml_catalog.public_id)::text) -> Seq Scan on systemmetadata (cost=0.00..8040.92 rows=142587 width=85) Filter: ((obsoleted_by IS NULL) AND (NOT archived)) -> Hash (cost=3.42..3.42 rows=50 width=39) -> HashAggregate (cost=2.92..3.42 rows=50 width=39) Group Key: (xml_catalog.public_id)::text -> Seq Scan on xml_catalog (cost=0.00..2.75 rows=70 width=39) Filter: (public_id IS NOT NULL) (20 rows) ```
amoeba commented 5 years ago

Reviewed and merged.