eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.58k stars 119 forks source link

Match via regex #204

Open eresturo opened 4 years ago

eresturo commented 4 years ago

Hi, another suggestion, how about defining one (or more) regex expressions to automatically match correspondent / concerning or even tags? Would reduce the manual effort, at least in my workflow. Sheers eresturo

eikek commented 4 years ago

Thank you! Yes, I know there is a lot of room here. I'm currently thinking about how to add "custom processing" to the mix. I don't want to "hardcode" a specific regex matching, though, but I realize that this is a major use-case.

For now, the least worst idea (still thinking about it…;-)) is to make it possible to add custom scripts that will be executed after the "initial" processing. There you'll have all the power you want (not defined yet how exactly this works, but it should be "convenient" to write these scripts). The scripts could be given as a github url (or a url to a zip file, etc), so it would be possible to share them. They could accept a config file – then it is possible to create a generic script/tool to add things based on regex matches. The downside is, that people need to take care themselves that all job-executors have the required programs to run these scripts. But (currently) I think this is an ok thing, because the target audience are "computer people" anyways.

There was something similar brought up here (fyi).

This is definitely on the roadmap, but rather to the end of the year (I hope).

eresturo commented 4 years ago

Nice, I like the idea of custom scripts that can be shared. However, this should be optional, as expert settings for power users. IMHO one could find the matching via custom keywords general enough that it should be available to everyone without a custom module. Maybe you can take a look at how Paperless matches: https://paperless.readthedocs.io/en/latest/guesswork.html?highlight=regex#how-do-i-set-up-these-matching-algorithms

ministryofsillywalks commented 4 years ago

I would also really miss this feature moving over from paperless. It's so easy to setup Tags which respond to some keyword. For example I like to tag all documents coming from insurance so I setup the tag insurance and have it match to the names of the different insurance companies. It's pretty intuitive to use and I would love to see something similar in doscpell.

eikek commented 4 years ago

Thank you a lot for your input and this example! I myself don't use this pattern, so it helps a lot to understand what is needed. I'm still hesitant to add this as a primary feature. I know a lot of people that I cannot sell the "make your own rule" thing. It's too hard and they are not interested in tinkering. It also requires to update all rules if a new company comes along. Docspell in general tries a different approach: All tags that can be somehow derived from the text content should be possible to learn from existing data. So it tries to find the rules itself. The downside is, of course, that you need to tag a few documents until this works well. But your use case could already work – I've never tried with this exact scenario though. Currently it can only derive tags from one category, but this will change soon.

I do plan to add support for custom processing/rules in the future. There is no ETA, though :/.

ministryofsillywalks commented 4 years ago

In my workflow I use the tags as "folders" if I were to compare this to the physical world. However the nice thing is that documents can live in multiple folders. So for example the insurance documents for the house get the insurance tag and also the house tag. Can a document have more than one tag in docspell? I haven't deployed it yet but I'm hoping to have some time this weekend to tinker with it.

eikek commented 4 years ago

Ah yes, I'm using the same system. I also like this flexibility. Yes, documents can have multiple tags. You can also assign a group name to a tag, to group multiple tags together. Then docspell can learn from existing tagged documents and suggests a tag from a given category for new documents. This only works if it can be somehow derived from the text. I hope to provide this for multiple tag categories in the near future.

totti4ever commented 3 years ago

Having this for multiple categories would make a lot sense! And maybe we can put the whole AI magic tagging process a bit more in the focus as I thought (and still think) that regex matching was necessary as I couldn't really see the AI magic happen. Once that happens, I totally believe that additional regex is not necessarily needed!