algolia / algolia-sitemap

a node library allowing you to generate sitemaps from an Algolia index.
https://yarn.pm/algolia-sitemap
MIT License
35 stars 15 forks source link

Sitemap results rounding #145

Open bomjastik opened 5 years ago

bomjastik commented 5 years ago

Hello.

I'm trying to create the site map and everything works well besides one thing.

When I'm trying to create the site map from the index with 29,248 rows, I'm getting 29,000 result rows in the site map. The exact 29,000.

I have tried to do the same for another index with 3,291 rows, and I have got the exact 3000 result rows in it.

For smaller indexes with less than 100 rows, I'm getting correct result rows number.

Is there any reason why my results rounding may happen?

My index.js:

const algoliaSitemap = require('algolia-sitemap');

const algoliaConfig = {
    appId: '',
    apiKey: '',
    indexName: '',
};

function hitToParams({pretty_url, objectID, station_name}) {
    const url = pretty_url => `some_url`;

    const loc = url(pretty_url);

    const lastmod = new Date().toISOString();

    const priority = 0.8;

    return {
        loc,
        lastmod,
        priority,
    };
}

algoliaSitemap({
    algoliaConfig,
    sitemapLoc: 'https://mydomain.com/sitemaps',
    outputFolder: 'sitemaps',
    hitToParams,
}).then(() => {
    console.log('Done generating sitemaps'); // eslint-disable-line no-console
})
    .catch(console.error);

Regards.

Haroenv commented 5 years ago

Is this happening with v2.2.0, did you try out if this also happens for you on 2.1.0? Thanks!

alexpchin commented 4 years ago

Hi, @Haroenv I've just started to implement this too and I currently have 2,159 results but I'm only getting 2000 URLs in sitemap.0.xml?

Haroenv commented 4 years ago

do you have multiple sitemap files @alexpchin? Does this also happen with appId: 'latency', indexName: 'instant_search', apiKey: '8baa3791b46c2a52259007747e743e07'?

alexpchin commented 4 years ago

Hi @Haroenv, I've just checked with:

algoliaSitemap({
    algoliaConfig: {
      appId: "latency",
      apiKey: "8baa3791b46c2a52259007747e743e07",
      indexName: "instant_search",
    },
    hitToParams: ({ objectID }) => {
      return {
        loc: `https://someurl.com/${objectID}`,
        lastmod: new Date().toISOString(),
        priority: 0.7,
      }
    },
  }),

Only one sitemaps/sitemap-index.xml and sitemaps/sitemap.0.xml are created. Inside sitemaps/sitemap.0.xml there are 21,000 URLs.

  const aggregator = async (args) => {
    let { hits, cursor } = args;
    do {
      if (!hits) {
        return;
      }
      batch = batch.concat(
        hits.reduce((entries, hit) => {
          const entry = hitToParams(hit);
          return entry ? entries.concat(entry) : entries;
        }, [])
      );
      if (batch.length > CHUNK_SIZE) {
        await flush();
      }
      ({ hits, cursor } = await index.browseFrom(cursor));
     **console.log('hits -->', hits.length)**
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 1000
      // hits ---> 469
    } while (cursor);
    **console.log('batch ---> is 21,000', batch.length);**
    await handleSitemap(batch);
    const sitemapIndex = createSitemapindex(sitemaps);
    await saveSiteMap({
      sitemap: sitemapIndex,
      root: outputFolder,
      filename: 'sitemap-index',
    });
  };

*Aside, at some point we could potentially add a check if the output folder hasn't already been created and create it?

Haroenv commented 4 years ago

Thanks for checking, that indeed doesn't look right! Maybe the last batch doesn't have a cursor, and therefore the loop stops. We should switch it to a regular while loop and do the fetching the beginning of the loop to avoid it I think. The program flow will be much simpler

for your aside, that definitely makes sense I think mkdir now has a -p option in node that we can use

alexpchin commented 4 years ago

Hi @Haroenv, sorry, yes I was going to follow this up with a suggestion for a solution as the problem is with the cursor but I was eating some noodles...

We could just solve quickly by using a run flag that updates at the start of the while loop if there is no curser?

  const aggregator = async (args) => {
    let { hits, cursor } = args;
    let run = true;
    do {
      run = !!cursor;
      if (!hits) {
        return;
      }
      batch = batch.concat(
        hits.reduce((entries, hit) => {
          const entry = hitToParams(hit);
          return entry ? entries.concat(entry) : entries;
        }, [])
      );
      if (batch.length > CHUNK_SIZE) {
        await flush();
      }
      ({ hits, cursor } = await index.browseFrom(cursor));
    } while (run);
    await handleSitemap(batch);
    const sitemapIndex = createSitemapindex(sitemaps);
    await saveSiteMap({
      sitemap: sitemapIndex,
      root: outputFolder,
      filename: 'sitemap-index',
    });
  };

  return index.browse(params).then(aggregator);
}

This would work but arguably a little more confusing in terms of flow.

Haroenv commented 4 years ago

I think that makes sense, but a regular while with the fetching first will be simpler probably :)

alexpchin commented 4 years ago

@Haroenv Did you want me to PR? I saw that there is also a number of chores outstanding too?

Haroenv commented 4 years ago

All PRs that were open were for devDependencies, so not urgent for this project. A PR would be well appreciated if you confirm that this is the source of the bug. Thanks @alexpchin !

Pr00xxy commented 3 years ago

@Haroenv Hi was there any follow up on this issue? I have started prototyping an implementation using this library but this bug is a blocker for now

Haroenv commented 3 years ago

Does the fix suggested work for you @Pr00xxy?