fedwiki / wiki-server

Federated Wiki client and server in Node.js
Other
153 stars 35 forks source link

changes to sitemap generation strategy #85

Closed paul90 closed 9 years ago

paul90 commented 9 years ago

This pull request changes the way sitemap.json is served. Rather than create the sitemap when requested we create it when the server is started, and update it whenever a page is created or modified.

We also save the file to status/sitemap.json - so we can let the server handle requests from clients that will be caching.

WardCunningham commented 9 years ago

If I understand the logic correctly, the sitemap module will be instantiated for each server in a farm and that instance will keep an in-memory copy of the sitemap used only to simplify incremental update. Is this wise use of memory in the context of a large farm where only a small percentage of sites are actively updated? Perhaps this in-memory copy could age out and be reconstructed from the page db when editing resumes. Nothing would be waiting on the reconstruction since /system/sitemap.json requests would still be served from the filesystem.

WardCunningham commented 9 years ago

We're assuming that page updates come through the web. This isn't true for content importers that write new content directly to the pages directory. Typical workflow involves repeatedly running the importer script then refreshing RecentChanges to see if the created pages are as desired. Malformed pages are common during importer development so sitemap generation should be tolerant and quick to move on.

I notice node fs has several watch mechanisms that connect to platform notifications. This might be just the ticket here or it could be more per-site overhead incurred by farms.

Another strategy might be to trigger a reconstruction by having the importer delete the file system's copy of the sitemap.json. The server wound then just need to watch this file or stat for it on every sitemap request. This adds some new burden on importers but one I could live with.

WardCunningham commented 9 years ago

Please forgive the profusion of comments. I'm excited by the efficiency offered here especially knowing how we've come to depend on sitemaps and aspire to exploit them more from the client side.

paul90 commented 9 years ago

Please forgive the profusion of comments.

Well, it is just a first pass - some things like writing it to file are just an expedient so the code does not need to worry about client request and if the client already has an up to date version.

paul90 commented 9 years ago

Maybe a way around, would be to serve the sitemap from memory, if it exists there, otherwise from disk. This adds an extra level of complexity, as we will need to handle a number of things so that client caching works - no point in sending a copy if the client already has a copy.

Maybe the in-memory version is only kept while there are update requests in the queue, and the sitemap read-in when it needs updating. Probably over aggressive at freeing up memory, but still a lot better than creating it from scratch.

Not sure about importers, deleting the sitemap file probably creates work that could be saved. Not sure that watch would work, if only as the pages might be being saved to the file system. Maybe if the importer saved a list of pages it has imported, and the presence of that list triggers an update of the sitemap entries for those pages.

paul90 commented 9 years ago

Is there any point in avoiding duplicates in this queue?

Really not sure if there is - especially as it now only saves the sitemap all items in the queue have been updated.

WardCunningham commented 9 years ago

It just occurred to me that you might be waiting on me. Let's see this work.

paul90 commented 9 years ago

There is a couple more change I still want to make:

WardCunningham commented 9 years ago

Yes, both are long-term important. I'm still happy to see this released now.

paul90 commented 9 years ago

Published as part of + wiki-server@0.5.1