Introduce search across all of HexDocs

josevalim commented 9 months ago

The goal of this feature is to provide search and autocompletion across packages. We will add a new configuration, called related_deps, which is a list of package names we find related. We will improve both autocomplete and search to use this, such that:

Autocompletion
- Without related_deps
- Only autocompletes the current project (current behaviour)
- With related_deps
- Autocomplete the current package and all related dependencies
Full-text search
- Without related_deps
- By default searches the current project (current behaviour)
- We will show radio buttons that allows you to customize the search. The options are "[ ] Current project" (default) and "[ ] HexDocs"
- With related_deps
- By default searches the current project and all related deps
- We will show radio buttons that allows you to customize the search. The options are "[ ] Current project", "[ ] Current project + Related packages" (default), and "[ ] HexDocs"

To power this feature, we will build a new service that does both autocompletion and search based on SQLite3 database. We have proof of concepts from:

@ruslandoga who shared his notes here: https://gist.github.com/ruslandoga/7f0f5b68d760ec5b3e650e7f73f694f2
@jeregrine who posted his code here: https://github.com/jeregrine/hex-search

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

josevalim commented 9 months ago

Btw, I have a dump of the database already, in case someone wants to use it for a proof of concept. Just ping me elsewhere and I will send a link. We should also skip any license.html and changelog.html files we find.

ruslandoga commented 9 months ago

@josevalim 🙋‍♂️ I'd like to compare the dump with the data I've scraped.

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

josevalim commented 9 months ago

Getting access to logs is probably difficult but the Hex team may accept a PR that adds this computation. I cannot answer for them though, so you will have to ask. :)

jeregrine commented 9 months ago

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

The code actually only grabs new packages and re-indexes them since Hex can sort by updated_at. So you could run that daily and it would take seconds.

One of the reason's its so slow is that the the json containing the indexable items sidebar_items-<rand_id>.js and search_items-<rand-id>.js is always different and I need to GET the HTML, find the script src then GET the js; then do the same for the search page. Changing the rand_id to a query string for cache busting like search_items.js?vsn=<rand-id> would mean I could only make 2 requests and skip parsing html.

@ruslandoga who shared his notes here: https://gist.github.com/ruslandoga/7f0f5b68d760ec5b3e650e7f73f694f2

@ruslandoga nice idea with the sqlite C function I did the lazy way with SQL and its not too slow https://github.com/jeregrine/hex-search/blob/main/lib/hex_docs_search/hex.ex#L50

ruslandoga commented 9 months ago

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

josevalim commented 9 months ago

@jeregrine oh, so you skip downloading the whole docs tar?

jeregrine commented 9 months ago

@jeregrine oh, so you skip downloading the whole docs tar?

Didn't even know it was downloadable. :-) But yea I don't do that it might faster at a cost of more disk/memory usage. ¯_(ツ)_/¯

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

Actually more I think about it, nvm. Its messy.

rhcarvalho commented 9 months ago

In the current design, would this require packages to update their ex_doc dependency and release a new version or would it work regardless of which version of ex_doc was used to generate the documentation?

ruslandoga commented 9 months ago

👋 @rhcarvalho

The new search functionality (assets/js) would only be present in the new ex_doc version, so I think it's more likely that the packages would need to upgrade to get global search from their documentation pages. But for a package to be indexable, they don't need to upgrade.

zachdaniel commented 7 months ago

👋 hey everyone, just checking in. Is this in progress? If so, any way I can assist? If not, I may be able to help get it off the ground :)

josevalim commented 7 months ago

There is a delay because we are also investigating if it makes sense to add embeddings to the docs, so we can also use it to provide context for LLMs (such as OpenAI). I will try to post more information soon. :)

zachdaniel commented 7 months ago

Sounds good! Thanks for your hard work. Not trying to hurry. I'm happy to wait, just want to assist if possible/warranted.

josevalim commented 7 months ago

That's really good to know. I will reach out once we have an action plan, unless you are also happy to get involved in the "figure it out" process and write some JS too? :)

zachdaniel commented 7 months ago

Yeah, I'd be very happy to be involved in any way. Cross package search is a major win for the Ash ecosystem, and is absolutely worth me spending my time on.

couhajjou commented 7 months ago

I see 4 search planes:

repo - current repo
deps - all deps (from mix.deps)
framework (set by framework author)
pinned repos (set by user)

Please empower the user

couhajjou commented 7 months ago

I am WIP-ing 'pinned repos' in ex_doc. It's the most versatil. the idea is that any repo should have a JSON file search_data.json

it's just the json version of this file: https://hexdocs.pm/ash_postgres/dist/search_data-C114CB12.js

both search_data.js and search_data.json will include the package info like this:

That would allow the UI to ingest the search_data.json files of the pinned repos

and display the info like this

And we need to change the UI a bit, but that Idea was already sketched up in this post. It just a matter of a little UI design.

Pinned repos can be stored in

local storage.
or chrome extension
or an account on hexdocs. (so that hexdocs can have all our emails ;)

It's not a big change to ex_doc.

And ofc we need to keep caching and versioning at it is now in search_results_72517.js

josevalim commented 7 months ago

We explored this but sometimes those files can be really large and building a index of all of them in realtime would become very expensive. Often the resulting index was so large that it would blow up local storage, which would cause us to index them every time, making it worse.

couhajjou commented 7 months ago

@josevalim, I am not sure that you read my comment here: https://github.com/elixir-lang/ex_doc/issues/1811#issuecomment-1890037931

here it is again:

I see 4 search planes:

1- repo - current repo
2- deps - all deps (from mix.deps)
3- framework (set by framework author)
4- pinned repos (set by user)

I am addressing here the solution 4-pinned repos.

in the local storage we just store the list like this:

pinned-packages: [
  {
     package: 'ash',
     search-indrex-url: 'http://hexdocs/ash/searchIndex.json'
  }
  {
     package: 'ash_postgres',
     search-index-url: ....
  }
]

it's the user who decide wich repos he want to 'Pin'

Ash search index is 104KB, it's cached in browser cache 10 ash repos would be around 1MB.

So for ash framework users it will be a few bytes in the local storage. And 1MB in the cache.

please correct me if I am missing something. as I am WIP-ing this.

couhajjou commented 7 months ago

Here is the architecture and the UI I propose for search

1- repo - current repo ====>ex_doc feature. Offline and online search 2- deps - all deps ====>hexdocs search engine. Online only. not available offline 3- framework (set by framework author) ===> ex_doc feature. Offline and online 4- pinned repos (set by user) ===> ex_doc feature. Offline and online

So ex_doc search for 1 3 and 4 And hexdocs search for 2

We have to have one UI. THE SEARCH INPUT in ex_doc will be able to do :

call to ex_doc internal (this how it works now)
call to hexdocs search API (to be implemented)

I am.just WIP-ing 3 and 4 1 is working 2 it's an hexdocs project. Needs someone like algolia

So with 1 3 and 4 I can do some ash and phoenix coding on the plane @zachdaniel ;)

couhajjou commented 7 months ago

A complication to discuss later: You can pin online depos and/or local depos (if they are in your HD).

Like mix.deps can have local and remote packages.

Sounds complex but can be simple ....

josevalim commented 7 months ago

Right. But you can think a new user would also want to pin Elixir itself and we know for a fact Elixir was too big to cache (so we added compression). Ecto and Phoenix are also on the larger side too. So I wonder if those three would not be enough to below up session storage space?

couhajjou commented 7 months ago

Local cache is 10MB. Elixir search index is 2MB. On the plane it's not a pb, we loading from disc. Online we might have a cache miss, it's life :) Then the browser hits the CDN. If you want to cap everything to 10MB you can and make it like an amazon kindle and tell the user you don't have more storage with an UI like this:

Pinned Repos	Size	Actions
ash	108 KB	[Unpin] [Download local doc] <- Keep coding while on the plane to ElixirConf
elixir	2.1 MB	[Unpin] [Download local doc] //sessionStorage.removeItem(elixir)
Total	2.2 MB

I am saying let's empower the user. The persona is a Dev. So it's ok if the UX is technical a bit.

---This is tangent and maybe crazy ----

This is tangent but we could also Ideate a chrome extension UX. Don't we need one for phoenix ? It can be The Phoenix Chrome extension and we could put other things in there.

A level of gamification is to track the most pinned repos. Like github (forks/stars). It can create another dynamic, with prizes in ElixirConf. It's a design technique used in Building Architecture. You take a technical limitation and make it useful and elegant. (Designer Trick)

josevalim commented 7 months ago

Interesting...

However, I think we have to be a bit less optimistic. We still need session storage for other indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time. So maybe 7MB of custom search max.

And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos out of the box and I would focus on that instead. The good news is that I am quite sure your ideas could be fully explored as a separate project!

PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available.

couhajjou commented 7 months ago

For the plane use case, when I am working within a project all I need is within my_ash_project/deps.

Ex_doc should reach there. Think of my_ash_project/deps as a cache for hexdocs.

I understand you want to do 2. But till then. ex_doc or a fork of it can do 1 3 and 4.

I would use it locally. My understanding is that you allow different documentation tools.

So for me it's not either ex_doc or hexdocs search. It's both of them.

If you decide to enforce a certain builder on hexdocs I ll respect that.

And I can use ex_doc_mutirepo_search as a local book on my computer. I love to have physical copy on my disc. Ex_doc is great and we can make it better.

Thanks

On Sat, Jan 13, 2024, 3:36 p.m. José Valim @.***> wrote:

Interesting...

However, I think we have to be a bit less optimistic. We still need session storage for other indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time.

And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos out of the box and I would focus on there instead. The good news is that I am quite sure your ideas could be fully explored as a separate project!

PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available.

— Reply to this email directly, view it on GitHub https://github.com/elixir-lang/ex_doc/issues/1811#issuecomment-1890762575, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAT6TYEXMQUENULQO2NFFDYOLV4NAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQG43DENJXGU . You are receiving this because you commented. Message ID: @.***>

josevalim commented 7 months ago

I see, that definitely feels out of scope for ExDoc then. :) I recommend exploring this on your own, something that builds the docs in the deps folder and creates a unified search interface. Bonus points if it works both online and offline. Meanwhile, let's please refocus this issue on its original description. Thank you!

couhajjou commented 7 months ago

@josevalim, In that case I would suggest to move hexdocs search feature as you envisioned it to the hexdocs repo.

Here are my arguments:

ex_doc is not hexdocs.
ex_doc is an HTML eBook generator. The generated eBook is searchable and self contained. Search feature is part of the generated book.
hexdocs is a book library. The book library should have a search engine.
ex_doc search is client based
hexdocs search is server based
hexdocs search architecture is to be done within hexpm/hexdocs team/project effort
hexpm should publish a protocol that have to be satisfied by package authors who want their documentation to be searchable by hexdocs.
that protocol will be implemented by ex_doc vestion X, and the upgrade will be seamless: upgrade ex_doc, run mix docs
From business perspective:
- ex_doc is a product. (distributed thought github)
- hexdocs is a service (run by hexpm organisation)

I suggest we figure out the TechnicalDesign/Architecture of the search functionality. we have 2 product/services (ex_doc, hexdocs).

For UX I would suggest the apple approach, one UX across physically separates complementary devices.

One search experience through ex_doc and hexdocs, the user will not notice the discontinuity.

josevalim commented 7 months ago

That's historically how we have implemented features in Hexdocs that are used by ExDoc and that's most likely how we plan to implement this one too: Hexdocs provide a generic interface for others to hook into and ExDoc simply acts as one of the clients.

josevalim commented 7 months ago

It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then works as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway.

couhajjou commented 7 months ago

Interesting

An ex_doc with a plugin architecture would be cool (embedding search form and search results)

So that ex_doc wouldnt have code dependency to hexdocs

AND Integrating different search engines. (Including Google) would be super easy and free

And one day AI search within ex_doc UX

You don't have to know ex_doc code base to implement a search plugin

On Mon, Jan 15, 2024, 3:15 p.m. José Valim @.***> wrote:

It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then work as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway.

— Reply to this email directly, view it on GitHub https://github.com/elixir-lang/ex_doc/issues/1811#issuecomment-1892721486, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAT6T6KLC4J3DUUBWEWCSLYOWE5TAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSG4ZDCNBYGY . You are receiving this because you commented.Message ID: @.***>

ruslandoga commented 3 months ago

👋

I'm interested in working on this and would love to collaborate with anyone else currently involved! I'll start by revisiting the SQLite approaches and checking if there are better options available now (typesense, meilisearch, etc.).

josevalim commented 3 months ago

Hi @ruslandoga! At the moment, we are thinking about going with Postgres. We will compute our own text embeddings using machine learning models and store them with pgvector. What are your thoughts?

ruslandoga commented 3 months ago

👋 @josevalim oh right, I forgot about your comment above on wanting to add semantic search... Sorry! I should probably reread this thread.

With SQLite I kept the embeddings in a BLOB and loaded them all in memory on startup and used https://github.com/elixir-nx/hnswlib as index. That was too complicated and a bit resource-intensive, pgvector would likely make it much simpler and more efficient :)

But I was rather wondering about the basic global search, like a global autocomplete, is that still planned? Would Postgres be used for that as well?

josevalim commented 3 months ago

Yes, the goal would be to use PG for that as well.

elixir-lang / ex_doc

Introduce search across all of HexDocs #1811