Geeklog-Core / geeklog

Geeklog - The Secure CMS.
https://www.geeklog.net
24 stars 19 forks source link

Creating the Entire Sitemap each time Content is Added is Slow on Large Sites #1050

Closed eSilverStrike closed 2 years ago

eSilverStrike commented 4 years ago

For larger sites like Geeklog.net the sitemap takes a while to write when information for about 1,000's of forum posts and articles need to be gathered and written to a file each time some new piece of content is added. For example this adds 3 to 5 seconds to every new forum post when saved.

We need to give the sitemap the ability to just add, update or delete individual entries as needed. This will need to work for all sitemaps: the regular sitemap, the mobile sitemap, and the news sitemap.

When a config option is changed remember the entire sitemap will have to be rewritten. We should also add an option in the admin to recreate the entire sitemap just in case something goes wrong.

To do this we will need to update Plugin API like

$result = PLG_collectSitemapItems($type, $uid, $limit);

or add new plugin api that plugins can support which would allow the sitemap plugin to add, update, or delete individual items.

For example right now if a forum post gets added or deleted the entire sitemap needs to get recreated which takes a long time for extremely large sites that may have thousands of posts. This way the sitemap plugin could just scan the sitemap and remove the one that needs deleted or add in the new entry.

Changes would also have to be made to the api like for plugin_itemsaved_foo, plugin_itemdeleted_foo, and XMLSMAP_update

function plugin_itemsaved_xmlsitemap($id, $type, $old_id)
{
    XMLSMAP_update($type);
}

function plugin_itemdeleted_xmlsitemap($id, $type)
{
.....
}

We would also need to pass in a subtype (like the Likes API already does) to support plugins that have more than one item type) along with an id (if it doesn't have it already) and an action variable. Here is an example of what the update API may look like:

function plugin_itemsaved_xmlsitemap($id, $type, $old_id, $sub_type = '')
function plugin_itemdeleted_xmlsitemap($id, $type, $id = '', $sub_type = '')
function  XMLSMAP_update($type, $sub_type = '', $id = '', $action = 0);

This is just a preliminary look at what needs to be done.

mystralkk commented 4 years ago

According to https://www.sitemaps.org/protocol.html , there are no such tags as "subtype" in defining sitemaps. Other than this, I agree with your suggestion.

eSilverStrike commented 4 years ago

I meant the subtype to be part of the Geeklog API functions only. Some plugins could have multiple items of different types and we need away to help plugins identify which type of item we want to update in the Sitemap plugin.

I had this issue when doing the Geeklog Likes System so I included the subtype variable in the likes API functions calls. See PLG_canUserLike for an example. I also added this for topic assignments at the same time. This is still a problem for other Geeklog items like comments and calls in lib-plugin which will need to be fixed in the future.

For example take the Media Gallery. The Media Gallery has 2 different items it stores information about: Albums and Media. With our Geeklog API calls (for example PLG_itemSaved which then calls plugin_itemsaved_sitemap), the type variable would be the plugin name itself “mediagallery” and then we have the id. This id passed could potentially be for either a album or media. If we start adding a sub_type in our functions, we then can allow plugins to differentiate between its own items. For example the media gallery could have sitemap items listed for both albums and media, or likes for both, or comments for both, etc... This is the same logic that is used by the Geeklog likes system and Geeklog topic assignments. At some point we should add this logic in for comments as well and for other plugin API like PLG_getItemInfo.

Most plugins will probably just have a empty sub_type since they only really deal with 1 type of item. But there are a few out their like Media Gallery, the Forum, etc.. that actually have a number of different items. Adding the sub_type now to the updated api calls would make it more future proof as well.

Does that explain things a better? 😀

The way I see this feature working (unless you have a better way and/or I missed something):

Something is saved in Geeklog.

function plugin_itemsaved_xmlsitemap($id, $type, $old_id, $sub_type = '')

would get called. Which then would call:

function XMLSMAP_update($type, $sub_type = '', $id = '', $action = 0);

At this point instead of XMLSMAP_update creating a whole new sitemap, it would take the $type, $sub_type, and $id information and request just the information for that id from the plugin (including the sub_type) which then would return the information needed to find (by URL I guess) and edit or delete the sitemap entry in the sitemap, or add the new entry (depending on what the action is). Either some new spamx plugin api would have to be created for this or we could update PLG_getItem function to return this information. If the info is not found/returned then we know the plugin does not support this new feature and the whole sitemap would have to be rebuilt.

mystralkk commented 4 years ago

Thanks for clarifying things much better. Now I have a better picture of how the XMLSitemap plugin should work.

eSilverStrike commented 4 years ago

Great. I looked a bit into this feature at one point but currently don't have the time to work on it. Adding this ability in will be very helpful for the larger Geeklog Sites and quicken the response times for adding and deleting.

When Geeklog.net gets a bunch of spam forum posts I end up disabling the sitemap plugin before deleting them as deleting more than 3 at once actually causes the server to exceed the PHP execution time due to rewriting the complete sitemap for each individual delete.

mystralkk commented 4 years ago

Implemented with change set 2da0a73. API changes involved are:

eSilverStrike commented 2 years ago

Just was doing some testing. Delete and item and patching the file looks to be working.

Add an item and patching the file not yet

All 3 sitemaps item entries are different (sitemap, mobile sitemap (just slightly different), and news sitemap)

Looks like additem and formatitem functions needs to be updated to take this into account as it only current adds in the xml for a regular sitemap and none of the others.

Also the news sitemap has extra restrictions which are news_sitemap_topics and news_sitemap_age config option. The item being added has to be an article AND if the article that is being updated/added doesn't meet these config options expectations then it shouldn't be added.

mystralkk commented 2 years ago

Fixed with change set f1a5edec489b32a80e3c95897d9e0ddf473934df. I was unaware of the news sitemap.

eSilverStrike commented 2 years ago

Ugh the sitemap is so tedious to test. Found issue with plugin_getiteminfo_story not filtering when story Id is specified (would only filter by date and topics if * passed). Hopefully this doesn't mess with anything else that relies on this function (I don't think it would)

Also fixed date passed to sql database for filter as it needed to be 'Y-m-d H:i:s' and the international date standard was used.

eSilverStrike commented 2 years ago

@mystralkk Can you retest the sitemap again? I think I got all the scenarios working now but might have missed something

I found a few issues which I fixed including:

1) fix for patching mobile sitemap where it was missing mobile indicator 2) fix for create all sitemap files running after everything is patched 3) items not getting deleted in sitemap when they should (like when post put into draft mode or permission change, etc...) 4) Patching file using default settings instead of what is stored in config

The filtering issue in the previous commit took a while to figure out as I was thinking it was the plugin issue when it was not.

Anyways hopefully that is it for this plugin except I guess issue #1105 where Bing has a new way to ping

mystralkk commented 2 years ago

@eSilverStrike , I am too busy this week, and I am afraid I will have no time to work on Geeklog till the first or second week of March.

eSilverStrike commented 2 years ago

Yup no problem I've been busy as well and haven't had as much time lately. I will probably give it another round of testing myself soon..