GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

Populate corpus dialog should allow to specify a pathname pattern #113

Open johann-petrak opened 4 years ago

johann-petrak commented 4 years ago

There is often a situation where a corpus is available as a large number of documents in a directory or directory three where the filename and/or path in the tree conveys important information for filtering or selecting documents, e.g. the filename may contain a year, a topic, a classification label etc.

It would then be extremely useful if we could specify a regexp to match the path names to import, where pathnames would maybe best be represented as URLs (so that subdirectory separators would always be slashes, even on Windows, and not backslashes which are very clumsy to use in regexps).

greenwoodma commented 4 years ago

sounds like the kind of thing best done via the groovy console so you can do any matching you want and use the information in any way you want; i.e. naming the documents or corpus from elements of the path. In fact I've done this previously a couple of times in projects to match things like language codes in filename and then set a document parameter.

johann-petrak commented 4 years ago

I think the effort to write this in Groovy is a lot bigger than entering a pattern in a dialog, even for people who occasionally use Groovy but even more so for people who enjoy GATE because they can use a GUI instead of a scripting language. I think it is reasonable to expect that advanced GATE users may know regexps or easily learn them as far as they need this there (especially if we have examples in the manual), but less reasonable to expect they they not just know Groovy, but also the API necessary to do this.

greenwoodma commented 4 years ago

I guess my point was that usually you are going to want to do more than select via a regex, in that you are probably going to want to do something with the information as well, and trying to cover all those options in a GUI would be a nightmare.

If you just want to select files based on the folder or file names, surely you could do that easily outside of GATE. I think file managers on both windows and linux allow you to select files according to regex patterns which you could then move into a new folder before loading into GATE. And of course you could easily do it on the command line.

I'm not saying it's a totally silly idea, I just thing that in the majority of cases it won't take you far enough and you'd still need to write a script to deal with setting document features etc. based on the file/path info.