etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 129 forks source link

Question: What about the ability to index files on the file system? #121

Closed philippeowagner closed 7 years ago

philippeowagner commented 9 years ago

We generate a lot of PDFs (orders, reports, ...) and have uploaded content that is on the file system. We thought about to get it into watson's search results page. Is somebody else interested in this use-case?

etianen commented 9 years ago

Anything you can extract text from can be used to index a model, so the contents of file fields are a valid candidate. You'd have to write a custom search adapter to do so, and use some sort of PDF reader library, but it should be doable.

I'm not sure that Watson wants to include specific file parsing libraries by default. A sensible approach might be a plugin-based system whereby Watson provides the functionality to add support for indexing different file types. It could ship with txt file parsing, but more complex file types would need a third party dependency. On Mon, 24 Aug 2015 at 22:13 Philippe O. Wagner notifications@github.com wrote:

We generate a lot of PDFs (orders, reports, ...) and have uploaded content that is on the file system. We thought about to get it into watson's search results page. Is somebody else interested in this use-case?

— Reply to this email directly or view it on GitHub https://github.com/etianen/django-watson/issues/121.

philippeowagner commented 9 years ago

Yap, it's absolutely doable from a technical point of view. I like the idea of a plugin-based approach... do you @etianen have a suggestion for the implementation (were to hook in, ...)?

etianen commented 9 years ago

I think that the way to specify a file field to index would be to include it in the "fields" argument to watson.register().

If the fields argument is not given, then the existing behaviour of indexing all text and char fields would still apply (indexing files would therefor represent an explicit opt-in, since it could have performance problems on some remote storage backends).

Currently, the consequence of specifying a file field for indexing is undefined, and probably crashes, so I don't think this would introduce backwards compatibility issues.

When a file field is specified, the search adapter would guess the type of the file based on its extension. It would then check a global plugin repository for a handler registered for that extension. If a handler is present, it will be called as a function with two arguments - the file extension and the file as a Django file object. It should return a unicode string of indexable text.

The API for registering a new plugin would be:

watson.register_file_handler(extension, handler_func)

There would also be an unregister function, for completeness.

Plugin apps would be included in INSTALLED_APPS, and use the Django app loading mechanism to register their extensions. A plugin app would be expected to handle a single or small group of related extensions, and avoid conflating external dependencies.

Sound correct? Any thoughts? On Tue, 25 Aug 2015 at 23:19 Philippe O. Wagner notifications@github.com wrote:

Yap, it's absolutely doable from a technical point of view. I like the idea of a plugin-based approach... do you @etianen https://github.com/etianen have a suggestion for the implementation (were to hook in, ...)?

— Reply to this email directly or view it on GitHub https://github.com/etianen/django-watson/issues/121#issuecomment-134758451 .