Automattic / jetpack

Security, performance, marketing, and design tools — Jetpack is made by WordPress experts to make WP sites safer and faster, and help you grow your traffic.
https://jetpack.com/
Other
1.59k stars 799 forks source link

Search: Index pdfs for searching #9023

Open gibrown opened 6 years ago

gibrown commented 6 years ago

We have support for indexing pdfs implemented for VIP sites and there have been a few requests for it for general Jetpack Search. The indexing is here: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-wp-post-field-builder.php#L671

Open questions:

  1. Can we lower that 10MB limit on file size low enough? There have been some hiccups with large files and that is even with the restriction that the files are hosted on vip infrastructure, what happens when we are fetching them from external sites on demand? How do we make sure we don't DDoS a site?
  2. Should we support other doc formats? Which ones?
  3. How should search results get displayed? For VIP we added fields as stored fields:

We then return these fields when we run this sort of search and it let’s the client display some number of them beneath each result. So we can match the pdf as it's own document, basically this is like a media post search which includes the content of the pdf. But we cannot match a post/page that has a pdf attached to it.

Integration with themes may get tricky because of this. VIPs seem to have not had too many issues, but not very many have used it yet.

stale[bot] commented 6 years ago

This issue has been marked as stale. This happened because:

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

RCowles commented 5 years ago

Requested in 1874536-zen

stale[bot] commented 5 years ago

This issue has been marked as stale. This happened because:

No further action is needed. But it's worth checking if this ticket has clear reproduction steps and it is still reproducible. Feel free to close this issue if you think it's not valid anymore — if you do, please add a brief explanation.

cena commented 1 year ago

6203208-zen

have jetpack search and joomunited wp file download. I managed to make joomunited results showup in the wordpress search, but it only search on the files titles,

I thougt with jetpack it would be possible also to search inside the files. (if I use jommunited search engine I mange to search inside the files, but it won't search articles and other post types.)

is there something I can do to have 1 universal search engine?

Jommunited WP File Download advertises:

Full-text search for documents with automatic index
    Category file filtering
    Tag filtering as checkbox or predictive search box
    Date of creation & Update range filter
    [Document preview](https://www.joomunited.com/wordpress-products/wp-file-download/wordpress-file-manager-document-preview) in search results
    File ordering in search results on column title click
    Compatible with WordPress native search engine
github-actions[bot] commented 1 year ago

Support References

This comment is automatically generated. Please do not edit it.

lizthefair commented 1 year ago

6664802-zen is another request for this feature.

gibrown commented 3 months ago

This may be a better approach: https://huggingface.co/blog/manu/colpali and https://blog.vespa.ai/retrieval-with-vision-language-models-colpali/