Closed amoeba closed 5 years ago
I see two big challenges:
is too narrow a limit since not all Member Nodes only use EML. I think a better limit here is that the formatID must be one of the registered formats in the XML catalog.
@amoeba Did you note that metacat.properties already has the guid.ezid.uritemplate.metadata
property, which is used to construct the canonical URL to be registered for DOI redirection?
Currently, the default is:
guid.ezid.uritemplate.metadata=/metacatui/#view/<IDENTIFIER>
But it gets changed to something sensible, and this gets used to build a landing page URI from the metacat deployment host and port and the tomcat context, something like 'https://knb.ecoinformatics.org/#view/doi:10.5063/F1RR1WFT`, which is sent off to DataCite as the redirect URI for that DOI.
In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.
I know you want to keep Metacat and MetacatUI separate, but in reality it seems that Metacat does need to know something about its URI space, especially given that Metacat is likely the component to handle content negotiation to even determine if MetacatUI gets invoked, rather than returning another format like EML, ISO, RDF, etc. In addition to its role in keeping DOI redirects updated.
Also, in looking at the code you linked, I see that we are using Writer.write
to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.
Ah, thanks @mbjones I didn't see that. It'd make sense not to duplicate that. Metacat can already grab the metacatui deployment context programmatically so my approach was to make the configuration for sitemap purposes optional and default to <%= MetacatUIContext %>/view/<%= PID %>
with what's in the middle being the configurable part.
In your case, I think this would become the canonical URI for the dataset, rather than the #view URI. This strikes me as a deeply similar need.
Yeah, totally. I think the Metacat propert(y|ies) controlling this could support linking to URLs under the same web root and URLs under another.
I know you want to keep Metacat and MetacatUI separate,
This was an aspiration that I no longer have. I remembered/found that Metacat asks the user to provide the MetacatUI context so we're already linking the two pieces of software together.
Also, in looking at the code you linked, I see that we are using Writer.write to stream the contents directly into a file, rather that building an XML model and adding elements to it, and then saving that to a file. So, the current implementation is probably not doing XML entity and character escaping properly.
Good catch! I always forget about those types of XML issues.
Made lots of progress on this these last few weeks. I now have a better Metacat dev setup than before and I can somewhat confidently re-deploy and test my code changes.
I ran into one thing while validating a test sitemap which is that the default behavior of sitemap validators is to require the sitemap be served from the same path as the URLs it's serving. This presents a conflict if we want to serve a sitemap at an MN deployment with canonical (DataONE) URIs in it.
For example, with a sitemap at http://example.com/catalog/sitemap.xml, all URLs in the sitemap need to start with exactly http://example.com/catalog/
. So if we want to use the canonical (dataone.org) URI for sitemap entries, we'd have to submit a cross-origin sitemap which looks doable but sitemap.org doesn't really recommend.
For MN deployments, I think it'd be best to serve the sitemap/sitemapindex at the root with URLs under MetacatUI's URL space (e.g., https://arcticdata.io/sitemap.xml). For CN deployments, we already serve a sitemap.xml file at the root so I think we would would probably want to serve the sitemap/sitemapindex at dataone.org/datasets (e.g., https://dataone.org/datasets/sitemap.xml).
I still wanna update docs to match these changes but this is ready for review. Metacat's Sitemap functionality has been significantly refactored to support our modern use of it. Please take a look!
Sitemaps no longer know about MetacatUI skins and instead point at MetacatUI routes (/view/{PID}
) and use DataONE PIDs instead of docids
Two new configuration parameters have been added to control both the URLs sitemaps are served at as well as the URL format of URLs inside the sitemaps.
# Base part of the URLs for the location of the sitemap files themselves.
# Either full URL or absolute path. Trailing slash optional.
sitemap.location.base=/metacatui
# Base part of the URLs for the location entries in the sitemaps which should
# be the base URL of the dataset landing page.
# Either full URL or absolute path. Trailing slash optional.
sitemap.entry.base=/metacatui/view
/metacat/
)I opted not to use the ezid property guid.ezid.uritemplate.metadata=/metacatui/#view/<IDENTIFIER>
for simplicity but I think these two sets of properties could be merged into higher level properties that both the ezid and sitemaps could use. Thoughts? Something like...
metacatui.baseurl=/metacatui
metacatui.viewroute=view
# remove guid.ezid.uritemplate.metadata, sitemap.location.base and sitema.entry.base
The SQL query for which documents to include in the sitemaps no longer restricts to only EML docs and uses DataONE PIDs instead of docids. Could I please get a review specifically on the query?
By default, sitemaps will still be generated at {metacat_context}/sitemaps/... so, to fully support sitemaps for Google's purposes, all Metacat installations will need to customize their Apache configs like so:
ProxyPassMatch "^/(sitemap.+)" "http://localhost:8080/metacat/sitemaps/$1"
ProxyPassReverse "^/(sitemap.+)" "http://localhost:8080/metacat/sitemaps/$1"
in order to be Google compliant. This is because sitemaps indexes and sitemap files must be served at or above the URL space for which the sitemap entries are in. e.g. dataone.org/something/sitemap.xml can't include URLs like dataone.org/view/x. So the DataONE sitemap needs to be served at either dataone.org/sitemap.xml or dataone.org/view/sitemap.xml.
Increased the sitemap entry limit from 25,000 URLs to 50,000 (Google's max). This will result in fewer sitemap files
Sitemap URLs are now property XML escaped
The Sitemap files have been renamed from metacatSitemapIndex
-> sitemap_index
and metacat1.xml
-> sitemap1.xml
Updated sitemap test for new functionality
This implementation supports the standard use of serving sitemaps for a Metacat installation using MetacatUI URLs and also supports serving sitemaps at dataone.org with dataset PURIs (dataone.org/datasets/{PID}).
Next steps:
Two notes to add from talking with @mbjones on Slack just now:
Feedback on those two bullet points would be greatly appreciated. When I'm back from traveling the next two weeks I can certainly take a deeper look at (1).
I think we would not want archived objects to be indexed by Google, since archiving an object is like saying "make this object less discoverable."
I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.
I'm torn on including obsoleted objects. What would happen if you did a Google search for a data set and the first result was an obsoleted version? We decided that within our own catalog search we would not show obsoleted versions.
I think this encapsulates my logic but I feel pretty torn too. If you look at a site like GitHub, which Google crawls, I imagine they refresh the crawled and searchable content every so often so that searches are essentially only searching the latest version of the README (and other info).
I Googled around to get a sense of this issue last week and didn't find a whole lot but your point about Google potentially returning old versions sounds like a situation we don't want. Sitemaps support relevance ranks, I believe, so I wonder if we could place a high rank on the latest version and a lower rank on all non-latest versions.
I did some soul searching (and talking to Mark Servilla), and I think it makes the most sense to restrict the records that go into sitemaps to:
This aligns the sitemap content to the content a user would see in the Metacat search catalog.
I also did some performance testing and I don't see any issues: Sitemaps generate in a second or two when there are ~50,000 documents. The SQL query is not great (See Query Plan below) but still isn't that bad. The default sitemap generation interval is a day so performance isn't that critical anyway. I am discussing with DataONE on doing a proper test for their use of the feature but haven't gotten set up to test it on a DataONE CN yet.
I'm going to PR this and so we can finalize this.
Reviewed and merged.
To support features we're adding to MetacatUI, we need to take another look at the Sitemap implementation in Metacat. In particular, we want to support Member Nodes running Metacat w/ MetacatUI, and also the DataONE Coordinating Nodes running Metacat w/ MetacatUI as well. Together with the features we're adding to MetacatUI:
Updating the Sitemap implementation will allow Metacat users running MetacatUI to be indexed by Google which would be awesome.
My plan is to:
We have an old ticket, https://github.com/NCEAS/metacat/issues/563, that I think I'll close as this ticket obviates that one.