dsottimano / xmlsitemap-extractor-google-sheets

38 stars 22 forks source link

Having troubles pulling multiple sitemaps due to loading limits / array issues? #2

Open jwest75674 opened 5 years ago

jwest75674 commented 5 years ago

Hey, firstly a thank you and preemptive apology for my lack of Apps/JS skills!

I have this custom function working, which is superb, but I am working on a sitemap that has ~50 sitemaps within it's index (by month). I am having issues getting it to play nice with some other scripts due to the loading time involved.

While I recognize that load time can never be eliminated, I am wondering if there is a simple way to use the sitemapindex function, have it quietly build the index of sitemaps in the background, and then pull all URLs from all sitemaps found, in a single function.

I tried essentially editing the code myself, but am still in the early learning stages, so couldn't get it to work.

I think the issue is related to the sitemap function not really being coded with arrays in mind?

Anyways, would love your thoughts (for my own learning) and any help you can offer.

Josh

Thanks for your time, and these custom functions either way!

dsottimano commented 5 years ago

I love apps scripts, but the script execution function limit absolutely kills it's functionality. We get 6 minutes for a script (back end) to fully execute, and 30 seconds for a custom function (formula) to execute.

You could do this "sitemapindex function, have it quietly build the index of sitemaps in the background, and then pull all URLs from all sitemaps found, in a single function" but it would depend on how many sitemaps you're pulling in, and more importantly how many URLs - remember the 2 million cell limit.

"I think the issue is related to the sitemap function not really being coded with arrays in mind?" - Not sure what you mean here, the XML url nodes are parsed and then extracted through a loop.

Do you have code to share?

jwest75674 commented 5 years ago

Hey, Thanks for the reply!

For sure, I am totally with you, Apps scripts are awesome. As a guy with technical nature, Apps scripts are proving to be a really great way for me to slowly pick up on Javascript in the absence of formal training/treaching. (I am working on making my life/job more efficient, so no one is in the office teaching me or anything, as no one knows this stuff, just me venturing into the wild to build spreadsheets, lol)

As such, do take my thoughts with a grain of salt, I really am in a learning stage, so if I suggest something that does not make sense, that's probably because it does not, lol ("I think the issue is related to the sitemap function not really being coded with arrays in mind?" --> This thought came from an apps script help page speaking about optimization for custom functions.)

Yeah, I am nowhere near the 200 cell limit. The issue is the site I am working on has been building one sitemap per month for a number of years now, but in the sitemap is only included the "new" content posted that month. Something like ~200 URLS total, but spread thinly over ~80 sitemaps. The issue is I am having to make a ton of calls for avg. 3 urls per sitemap.

Sure, I rewrote your code a 1/2 dozen times trying to ham-fist it into working (It's actually the loop function I think I need to read up on more). --> Actually, generally everything needs more learning, hahaha.

I will copy my current code iteration here when I get to work later.

jwest75674 commented 5 years ago

Here's my (failed) attempt at combining these two functions into a new one:

Updating this as I continue to try to get it working, ignore the edits ,lol.

// Copyright, unchanged, sitemap and sitmapIndex functions both unchanged above as well

/**
* Attempting to return all URLs in sitemap index file
*
* @param {"https://www.google.com/sitemap.xml"} sitemapIndexUrl REQUIRED The url of the sitemap
* @param {"http://www.sitemaps.org/schemas/sitemap/0.9"} namespace REQUIRED Look at the source of the xml sitemap, look for the xmlns value 
* @return Returns urls <loc> from an xml sitemap
* @customfunction
*/

function sitemapIndexAll(sitemapIndexUrl,namespace) {

  try {
    var xml = UrlFetchApp.fetch(sitemapIndexUrl).getContentText();
    var document = XmlService.parse(xml);
    var root = document.getRootElement()
    var sitemapNameSpace = XmlService.getNamespace(namespace);
    var xmlMap = UrlFetchApp.fetchAll(mapLocs).getContentText(); // Edit to fetchAll

    var maps = root.getChildren('sitemap', sitemapNameSpace)   //Edit to speak to maps, not URLS to save for myself confusion
    var mapLocs = []                                           //Edit to speak to maps, not URLS to save for myself confusion

    var urls = root.getChildren('url', sitemapNameSpace)
    var locs = []

    // Push all found sitemaps to mapLocs array

    for (var i=0;i <maps.length;i++) {
      mapLocs.push(maps[i].getChild('loc', sitemapNameSpace).getText()) 
    }

    // Push all found URLs across all sitemaps into locs array

    for (var x=0;x <urls.length;x++) {
      locs.push(urls[x].getChild('loc', sitemapNameSpace).getText())
    }

    return locs  
  } catch (e) {
    return e 
  }
}
RCheesley commented 5 years ago

Just stumbled across this - same issue here, in that it's a WP site which generates bazillions of sitemaps linked from the master sitemap. Would love to know if there's a way to work with this before I start manually processing them!

Most of the sitemaps only have like 1-2 resources on them, so it's rather infuriating!