Bashamega / WebDevTools

Web Dev Tools is a comprehensive online platform designed to empower web developers with a wide array of code samples and snippets.
https://wdt.adambashaahmednaji.com/
MIT License
51 stars 58 forks source link

#108 Sitemap XML Generator #441

Closed LenVavro closed 1 month ago

LenVavro commented 2 months ago

Description

Issue #108 Sitemap xml generator, that recursively iterate through website's pages, process its html and parse links to create sitemap.xml file.

Key points

Lightweight

To keep it lightweight I've decided not to use Playwright, Puppeteer or other similar package, but a simple fetch and regex. Furthemore, to minimize ram usage, I am processing HTML response in stream, therefore only chunk of the HTML is stored in the ram at once and immediately processed. The generated sitemap is also streamed to the frontend, so the user can see the progress in real time.

Limitations

Getting HTML content from different page is almost impossible without backend, bc of CORS policies in browsers. Therefore I had to fetch website's content on the server. However, I can see that web is hosted on the Vercel, which has a timeout for server/edge functions. Therefore I set runtime to edge, which should allow streaming response beyond 25s limit (source).

Limit visited pages

Visited page is URL, which content was fetched and processed. To limit number of visited pages I added a limit, which can be set in new env property NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT, if you leave it empty, default limit is 100. Meaning at most 100 pages will be in the final sitemap.xml. This limit is important to save hosting resources.

Type of change

Checklist:

Summary by CodeRabbit

Release Notes

vercel[bot] commented 2 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
web-dev-tools ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 14, 2024 7:09am
coderabbitai[bot] commented 2 months ago

Walkthrough

The changes introduce a feature for generating XML sitemaps, including the addition of a new environment variable in the .env.example file, an API endpoint for sitemap generation, a React component for user interaction, and utility functions for URL validation and limit retrieval. A comprehensive test suite for the sitemap generation function has been established, and a new entry for a sitemap generator tool has been added to the tools JSON file.

Changes

File Path Change Summary
.env.example Added environment variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT.
__tests__/lib/generator/sitemapXml.test.js Introduced a test suite for generateSitemapXML with mock fetch, including three main tests.
src/app/api/generator/sitemap-xml/route.js Added API endpoint for generating XML sitemaps with error handling for URL validation.
src/app/generator/sitemap-xml/page.jsx Introduced a React component for generating sitemaps, including state management and error handling.
src/db/tools.json Added new tool entry for "Sitemap XML Generator" with id 28 and link "/generator/sitemap-xml".
src/lib/generator/sitemapXml.js Added functions for generating XML sitemaps, including generateSitemapXML and utility functions.
src/lib/utils.js Added functions isUrlValid(url) and getSitemapXmlGeneratorLimit() for URL validation and limit retrieval.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ReactComponent
    participant API
    participant SitemapGenerator

    User->>ReactComponent: Input URL
    ReactComponent->>API: Fetch sitemap for URL
    API->>SitemapGenerator: Generate sitemap
    SitemapGenerator-->>API: Return sitemap XML
    API-->>ReactComponent: Return sitemap XML
    ReactComponent-->>User: Display sitemap

🐰 "In the garden where sitemaps bloom,
A new tool has come to dispel the gloom.
With URLs valid and limits in sight,
We generate sitemaps, oh what a delight!
So hop along, let’s fetch and create,
In the world of XML, we celebrate!" 🌼

[!WARNING] There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 eslint > If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration. warning eslint@8.57.1: This version is no longer supported. Please see https://eslint.org/version-support for other options. warning eslint > @humanwhocodes/config-array@0.13.0: Use @eslint/config-array instead warning eslint > @humanwhocodes/config-array > @humanwhocodes/object-schema@2.0.3: Use @eslint/object-schema instead warning eslint > file-entry-cache > flat-cache > rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported warning eslint > file-entry-cache > flat-cache > rimraf > glob@7.2.3: Glob versions prior to v9 are no longer supported warning eslint > file-entry-cache > flat-cache > rimraf > glob > inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
🪧 Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit , please review it.` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit. - Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
LenVavro commented 2 months ago

Hi @Bashamega, can you please have a look at my PR. Thanks.

Bashamega commented 2 months ago

Screenshot from 2024-11-13 20-22-06 it gives an error

Hello @annuk123 This was a issue with a tool, and io forgot to fix it. It should be solved now

annuk123 commented 2 months ago

Screenshot from 2024-11-13 20-22-06 it gives an error

Hello @annuk123 This was a issue with a tool, and io forgot to fix it. It should be solved now

okay

LenVavro commented 2 months ago

@Bashamega

The current sitemap API isn’t generating a full sitemap for my large site—it only captures part of it.

That's probably because of the limit (100), meaning - 100 pages will be processed (fetched html and parsed links). xml-sitemaps.com/ also has a limit of 500 in free tier. But please provide the page for me to verify.


I’m already on Vercel’s free plan, so I’d prefer a solution that doesn’t require upgrading. Are there any prebuilt sitemap APIs we could use to handle this?

I appreciate your enthusiasm for developing the API, but I’m concerned about the potential ongoing costs.

I understand, however I don't think there is a need to upgrade hosting and worry about cost, you can change the limit (which will affect ram usage and execution time) based on the usage and for those users who need more, this is an open-source project, they have access to the source code and can run it themself, slefhost, copy, modify, etc.

LenVavro commented 2 months ago

Screenshot from 2024-11-13 20-22-06 it gives an error

I've just added better error message

Bashamega commented 2 months ago

Thank you for the prompt reply @LenVavro

This is the website that i have tried it on: https://adambashaahmednaji.com/ I fear that the API will exceed limit, but we can push it to prod and see what will happen/

LenVavro commented 2 months ago

Thank you for the prompt reply @LenVavro

This is the website that i have tried it on: https://adambashaahmednaji.com/ I fear that the API will exceed limit, but we can push it to prod and see what will happen/

I've checked it @Bashamega and no issue was found, reason for the clipped sitemap is the default limit (100). You can increase it easily using env variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT=.

From my side, everything's ready for the merge.

Bashamega commented 2 months ago

Thank you for the prompt reply @LenVavro This is the website that i have tried it on: https://adambashaahmednaji.com/ I fear that the API will exceed limit, but we can push it to prod and see what will happen/

I've checked it @Bashamega and no issue was found, reason for the clipped sitemap is the default limit (100). You can increase it easily using env variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT=.

From my side, everything's ready for the merge.

Can we use another api? So it can scrape all the website. I don't want the generated sitemaps to be incomplete. I don't have a problem with using a third party free api

LenVavro commented 1 month ago

@Bashamega you can adjust the limit as you want using NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT e.g. 10mil and it will for sure generate full sitemap, but once again even xml-sitemaps.com has a limit of 500 in free tier, as you can read it on their homepage.

I didn't find any free API for this purpose and I don't think someone will provide their server resources for free or without limits. Vercel is already providing free hosting and I've implemented it, so in a way, this is the only available free and unlimited API to generate sitemaps 😄