Update Metacat's sitemap implementation

amoeba commented 6 years ago

To support features we're adding to MetacatUI, we need to take another look at the Sitemap implementation in Metacat. In particular, we want to support Member Nodes running Metacat w/ MetacatUI, and also the DataONE Coordinating Nodes running Metacat w/ MetacatUI as well. Together with the features we're adding to MetacatUI:

Updating the Sitemap implementation will allow Metacat users running MetacatUI to be indexed by Google which would be awesome.

My plan is to:

[x] Evaluate the current state of the Sitemap implementation
[x] Make any updates needed to most effectively achieve the goal of getting Metacat instances indexable by Google
[x] Test this implementation at appropriate scales to ensure it'll work both for Member Nodes and Coordinating Nodes

We have an old ticket, https://github.com/NCEAS/metacat/issues/563, that I think I'll close as this ticket obviates that one.

amoeba commented 6 years ago

I see two big challenges:

How will we tell Metacat about the URL format it should put in the sitemap file. Reducing the dependencies or complexity of dependencies between Metacat and MetacatUI is probably best. I don't think Metacat knows whether it's deployed with MetacatUI so it seems like an optional property with a sensible/valid fallback value makes a lot of sense
The current implementation,

https://github.com/NCEAS/metacat/blob/7c255728da43aae0e9382dc82b39ba9d77fbeb62/src/edu/ucsb/nceas/metacat/Sitemap.java#L120

is too narrow a limit since not all Member Nodes only use EML. I think a better limit here is that the formatID must be one of the registered formats in the XML catalog.

mbjones commented 6 years ago

@amoeba Did you note that metacat.properties already has the guid.ezid.uritemplate.metadata property, which is used to construct the canonical URL to be registered for DOI redirection?

Currently, the default is:

guid.ezid.uritemplate.metadata=/metacatui/#view/<IDENTIFIER>

But it gets changed to something sensible, and this gets used to build a landing page URI from the metacat deployment host and port and the tomcat context, something like 'https://knb.ecoinformatics.org/#view/doi:10.5063/F1RR1WFT`, which is sent off to DataCite as the redirect URI for that DOI.

In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.

I know you want to keep Metacat and MetacatUI separate, but in reality it seems that Metacat does need to know something about its URI space, especially given that Metacat is likely the component to handle content negotiation to even determine if MetacatUI gets invoked, rather than returning another format like EML, ISO, RDF, etc. In addition to its role in keeping DOI redirects updated.

mbjones commented 6 years ago

Also, in looking at the code you linked, I see that we are using Writer.write to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.

amoeba commented 6 years ago

Ah, thanks @mbjones I didn't see that. It'd make sense not to duplicate that. Metacat can already grab the metacatui deployment context programmatically so my approach was to make the configuration for sitemap purposes optional and default to <%= MetacatUIContext %>/view/<%= PID %> with what's in the middle being the configurable part.

In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.

Yeah, totally. I think the Metacat propert(y|ies) controlling this could support linking to URLs under the same web root and URLs under another.

I know you want to keep Metacat and MetacatUI separate,

This was an aspiration that I no longer have. I remembered/found that Metacat asks the user to provide the MetacatUI context so we're already linking the two pieces of software together.

Also, in looking at the code you linked, I see that we are using Writer.write to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.

Good catch! I always forget about those types of XML issues.

amoeba commented 6 years ago

Made lots of progress on this these last few weeks. I now have a better Metacat dev setup than before and I can somewhat confidently re-deploy and test my code changes.

I ran into one thing while validating a test sitemap which is that the default behavior of sitemap validators is to require the sitemap be served from the same path as the URLs it's serving. This presents a conflict if we want to serve a sitemap at an MN deployment with canonical (DataONE) URIs in it.

For example, with a sitemap at http://example.com/catalog/sitemap.xml, all URLs in the sitemap need to start with exactly http://example.com/catalog/. So if we want to use the canonical (dataone.org) URI for sitemap entries, we'd have to submit a cross-origin sitemap which looks doable but sitemap.org doesn't really recommend.

For MN deployments, I think it'd be best to serve the sitemap/sitemapindex at the root with URLs under MetacatUI's URL space (e.g., https://arcticdata.io/sitemap.xml). For CN deployments, we already serve a sitemap.xml file at the root so I think we would would probably want to serve the sitemap/sitemapindex at dataone.org/datasets (e.g., https://dataone.org/datasets/sitemap.xml).

amoeba commented 6 years ago

I still wanna update docs to match these changes but this is ready for review. Metacat's Sitemap functionality has been significantly refactored to support our modern use of it. Please take a look!

Sitemaps no longer know about MetacatUI skins and instead point at MetacatUI routes (/view/{PID}) and use DataONE PIDs instead of docids
Two new configuration parameters have been added to control both the URLs sitemaps are served at as well as the URL format of URLs inside the sitemaps.
```
# Base part of the URLs for the location of the sitemap files themselves. 
# Either full URL or absolute path. Trailing slash optional.
sitemap.location.base=/metacatui
# Base part of the URLs for the location entries in the sitemaps which should
# be the base URL of the dataset landing page.
# Either full URL or absolute path. Trailing slash optional.
sitemap.entry.base=/metacatui/view
```
- These two properties support full URLs (https://dataone.org/...) or just paths (/metacat/)
- I think the two properties could be refactored but I felt that being explicit was simpler
- I opted not to use the ezid property guid.ezid.uritemplate.metadata=/metacatui/#view/<IDENTIFIER> for simplicity but I think these two sets of properties could be merged into higher level properties that both the ezid and sitemaps could use. Thoughts? Something like...
```
metacatui.baseurl=/metacatui
metacatui.viewroute=view
# remove guid.ezid.uritemplate.metadata, sitemap.location.base and sitema.entry.base
```
The SQL query for which documents to include in the sitemaps no longer restricts to only EML docs and uses DataONE PIDs instead of docids. Could I please get a review specifically on the query?
By default, sitemaps will still be generated at {metacat_context}/sitemaps/... so, to fully support sitemaps for Google's purposes, all Metacat installations will need to customize their Apache configs like so:
```
ProxyPassMatch "^/(sitemap.+)" "http://localhost:8080/metacat/sitemaps/$1"
ProxyPassReverse "^/(sitemap.+)" "http://localhost:8080/metacat/sitemaps/$1"
```
in order to be Google compliant. This is because sitemaps indexes and sitemap files must be served at or above the URL space for which the sitemap entries are in. e.g. dataone.org/something/sitemap.xml can't include URLs like dataone.org/view/x. So the DataONE sitemap needs to be served at either dataone.org/sitemap.xml or dataone.org/view/sitemap.xml.
Increased the sitemap entry limit from 25,000 URLs to 50,000 (Google's max). This will result in fewer sitemap files
Sitemap URLs are now property XML escaped
The Sitemap files have been renamed from metacatSitemapIndex -> sitemap_index and metacat1.xml -> sitemap1.xml
Updated sitemap test for new functionality
This implementation supports the standard use of serving sitemaps for a Metacat installation using MetacatUI URLs and also supports serving sitemaps at dataone.org with dataset PURIs (dataone.org/datasets/{PID}).

Next steps:

Solicit feedback (this Issue comment)
Deploy and test on MN and CN installs
Wrap up and release

amoeba commented 6 years ago

Two notes to add from talking with @mbjones on Slack just now:

The SQL query that decides what documents to put in the sitemap only includes the most recent document in a chain. This means old versions won't show up: Maybe it's more correct to include all versions? It's hard to say what Google wants. Should the sitemap match the Search Catalog (non-obsoleted, non-archived metadata objects) or should it be exhaustive?
The SQL includes archived documents. It probably should not, right?

Feedback on those two bullet points would be greatly appreciated. When I'm back from traveling the next two weeks I can certainly take a deeper look at (1).

laurenwalker commented 6 years ago

I think we would not want archived objects to be indexed by Google, since archiving an object is like saying "make this object less discoverable."

I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.

amoeba commented 6 years ago

I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.

I think this encapsulates my logic but I feel pretty torn too. If you look at a site like GitHub, which Google crawls, I imagine they refresh the crawled and searchable content every so often so that searches are essentially only searching the latest version of the README (and other info).

I Googled around to get a sense of this issue last week and didn't find a whole lot but your point about Google potentially returning old versions sounds like a situation we don't want. Sitemaps support relevance ranks, I believe, so I wonder if we could place a high rank on the latest version and a lower rank on all non-latest versions.

amoeba commented 6 years ago

I did some soul searching (and talking to Mark Servilla), and I think it makes the most sense to restrict the records that go into sitemaps to:

Is publicly readable
Is metadata
Is the newest version in a version chain
Is not archived

This aligns the sitemap content to the content a user would see in the Metacat search catalog.

I also did some performance testing and I don't see any issues: Sitemaps generate in a second or two when there are ~50,000 documents. The SQL query is not great (See Query Plan below) but still isn't that bad. The default sitemap generation interval is a day so performance isn't that critical anyway. I am discussing with DataONE on doing a proper test for their use of the feature but haven't gotten set up to test it on a DataONE CN yet.

I'm going to PR this and so we can finalize this.

Query Plan

``` Sort (cost=60500.98..60860.32 rows=143735 width=54) Sort Key: systemmetadata.date_uploaded -> Hash Join (cost=19341.84..43273.88 rows=143735 width=54) Hash Cond: (identifier.guid = systemmetadata.guid) -> Hash Join (cost=6636.25..22973.40 rows=144141 width=91) Hash Cond: (xml_access.guid = identifier.guid) -> Seq Scan on xml_access (cost=0.00..10562.03 rows=144190 width=45) Filter: (((principal_name)::text = 'public'::text) AND ((perm_type)::text = 'allow'::text)) -> Hash (cost=3591.89..3591.89 rows=142989 width=46) -> Seq Scan on identifier (cost=0.00..3591.89 rows=142989 width=46) -> Hash (cost=9530.25..9530.25 rows=142587 width=53) -> Hash Join (cost=4.05..9530.25 rows=142587 width=53) Hash Cond: ((systemmetadata.object_format)::text = (xml_catalog.public_id)::text) -> Seq Scan on systemmetadata (cost=0.00..8040.92 rows=142587 width=85) Filter: ((obsoleted_by IS NULL) AND (NOT archived)) -> Hash (cost=3.42..3.42 rows=50 width=39) -> HashAggregate (cost=2.92..3.42 rows=50 width=39) Group Key: (xml_catalog.public_id)::text -> Seq Scan on xml_catalog (cost=0.00..2.75 rows=70 width=39) Filter: (public_id IS NOT NULL) (20 rows) ```

amoeba commented 5 years ago

Reviewed and merged.

NCEAS / metacat

Update Metacat's sitemap implementation #1263