[Feature Request]: Docs to be crawler friendly, and LLM discoverable

nikhil-swamix commented 6 months ago

Describe the problem

i tried loading doc using requests lib and parse it, but due to some tabbing nature of JS and py, it requires browser to render. example

Describe the proposed solution

static site generation, dump next and react once for all! its ment for really huge projects with heavy components, not documentation. not the right tech stack. (my opinion, i prefer svelte) . but hey i can copy the same from github source md files, but it wont allow systematic discovery via crawlers... which can be indexed by LLM for RAG purposes and prevent hallucinations.

in short: prerender flag / server side render in next

Alternatives considered

hugo or some well known SSG

Importance

nice to have

Additional Information

minor priority. but very useful.

tazarov commented 6 months ago

@nikhil-swamix, thanks for your explanation. There are several reasons why we use markdoc and next, and this is unlikely to change.

We value user experience over bot/crawler experience (as you pointed out, anyone that needs to index the docs can use the GH markdown files)
We want visual continuity between Chroma docs and the hosted platform (which will be coming out soon).
While we value your and the rest of the community's opinions, we do certain things a certain way 😀

Have you considered using a different than requests library - have a look here for inspiration https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/

nikhil-swamix commented 6 months ago

thanks for the update, i understand it would mean a change of many things. However, I accomplished my objective with different architecture, i.e. using a webkit engine, putting it behind a server, and receiving a rendered page with javascript support for given url, requests was pretty basic. the same way google bot crawls single page apps. I have considered navigating the README docs folder directly on github, but it was not scalable, as every project may or may not have docs folder, or maybe unorganized. for that, I'm building an auto documentation engine with LLM, so git clone source code, and running this layer will provide a base documentation if codebases are poorly documented. im doing it on 100s of projects so needed a universal solution. also checkout https://github.com/nikhil-swamix/UniversalDB , i'm creating a meta library which provide a uniform way to access different type of DBs, including SQL,NOSQL and vector. its aim is to be as pythonic as possible and naturally query db. let me know if similar functionality will benefit the chroma project.just a thought. Regards.

On Fri, May 17, 2024 at 9:54 PM Trayan Azarov @.***> wrote:

@nikhil-swamix https://github.com/nikhil-swamix, thanks for your explanation. There are several reasons why we use markdoc and next, and this is unlikely to change.

We value user experience over bot/crawler experience (as you pointed out, anyone that needs to index the docs can use the GH markdown files)

We want visual continuity between Chroma docs and the hosted platform (which will be coming out soon).

While we value your and the rest of the community's opinions, we do certain things a certain way 😀

Have you considered using a different than requests library - have a look here for inspiration https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/

— Reply to this email directly, view it on GitHub https://github.com/chroma-core/chroma/issues/2203#issuecomment-2117951642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM4AVT3KQNTX6FSYYAZZWFDZCYVNXAVCNFSM6AAAAABHXES7SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXHE2TCNRUGI . You are receiving this because you were mentioned.Message ID: @.***>

chroma-core / chroma