Basic functionality to maintain and load corpora.
pip install corpusinterface
A number of are available in the two following config files:
Here are some references with more corpora (some of which are also in the config files from above):
Provide a config file your-corpus.ini
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
and load it using init_config("your-corpus.ini")
.
Say, you packaged a number of files into a corpus
your-corpus
|- file_1.txt
|- file_2.txt
|- dir_1
|- file_3.txt
|- file_4.txt
and let's assume you made it available as a zip archive at http://your-website.com/your-corpus.zip
(git
repos and tar.gz
files are also supported). Without a config file, this corpus can be added and accessed as follows:
from corpusinterface import config, load
# add your corpus
config.add_corpus("Your Corpus",
access="zip",
url="http://your-website.com/your-corpus.zip")
# load the corpus
corpus = load("Your Corpus", download=True)
# access the data (using a file_reader of your choice)
for file in corpus.data(file_reader=lambda file, **kwargs: f"reading: {file}"):
print(file)
This will print
reading: ~/corpora/Your Corpus/file_1.txt
reading: ~/corpora/Your Corpus/file_2.txt
reading: ~/corpora/Your Corpus/dir_1/file_3.txt
reading: ~/corpora/Your Corpus/dir_1/file_4.txt
with ~
being replaced with your home directory (paths might be displayed differently, depending on your operating system).
Config files allow you to automate the procedure of adding a corpus and are convenient to provide more detailed information, in particular for other people who want to use your corpus.
Instead of specifying the necessary information from within Python, you can also put it in a config file:
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
If you put this file at the default location ~/corpora/corpora.ini
in your home directory or a file corpora.ini
in the current working directory, it is automatically loaded by init_config
on package import. Otherwise, you can load any config file by either calling reset_config
config.reset_config("your-config-file.ini")
which clears the config and reinitialises it, adding your-config-file.ini
(see init_config
for more fine-grained control) or by loading it separately
config.load_config("your-config-file.ini")
A default config file is shipped with the corpusinterface
package and automatically loaded by init_config
. It defines some useful defaults that are used for newly added corpora if no corpus-specific values are specified. You can see all the config information associated to your corpus by printing a summary:
print(config.summary(corpus="Your Corpus"))
[Your Corpus]
access: zip
url: http://your-website.com/your-corpus.zip
info: None
root: ~/corpora
path: ~/corpora/Your Corpus
parent: None
loader: FileCorpus
In particular, the default root
directory ~/corpora
was added and the corpus is stored in a path
that is a subdirectory ~/corpora/Your Corpus
according to its name (more on root
and path
below). Moreover, by default we assume to have a FileCorpus
consisting of a simple collection of files.
The parameters root
, path
, parent
, download
, loader
, access
, and url
are special and their values are treated in a particular way.
root
Root directory to store the corpus in. This should be an absolute path, ~
is expanded to the user home. If a relative path is specified, a warning is issued and it is interpreted relative to the current working directory. If parent
is non-empty, the value of root
is ignored and instead the parent's path
is used. A call to config.get(Name, 'root')
returns the effective value.
path
Directory to store the corpus in. This can be
~
is expanded to the user home), in which case root
is ignoredroot
or[Name]
is appended to root
.A call to config.get(Name, 'path')
returns the effective value. Note that for sub-corpora (with non-empty parent
) the parent's path
is used instead of root
.
parent
A parent corpus name or empty. If non-emtpy, the parent corpus should be defined separately and the value of root
is ignored and replaced by the parent's path
.
Initialisation (e.g. downloading from url
with access
method) is delegated to the parent corpus when loading a sub-corpus.
download
, loader
, access
, url
See the section on [loading a corpus](#Loading a corpus).
You can specify additional parameters that are handed over to the loader and (in case of the FileCorpus
loader) further passed on the your file_reader
function. For instance, you could specify
prefix: my prefix
in the config file or equivalently
config.add_corpus("Your Corpus",
...,
prefix="my prefix")
from within Python. Your file reader can then make use of this parameter (provided as a keyword argument, so you have to refer to it by the correct name)
file_reader=lambda file, prefix, **kwargs: f"{prefix}: {file}"
my prefix: ~/corpora/Your Corpus/file_1.txt
...
This is also the reason why we always need **kwargs
in a reader function to accept all keyword arguments that are provided, even if we decide to not use them.
The config values can be dynamically overwritten in the load
function
corpus = load("Your Corpus",
...,
prefix="other prefix")
other prefix: ~/corpora/Your Corpus/file_1.txt
...
or in the data
function:
for file in corpus.data(..., prefix="still different"):
...
still different: ~/corpora/Your Corpus/file_1.txt
...
You have full control over how the config is (re)initialised. A call to config.init_config()
or config.reset_config()
without any arguments will load the default config, look for corpora.ini
in ~/corpora
and the current working directory and load them, too, if present. This is equivalent to calling
config.init_config(default=None, home=None, local=None)
or
config.reset_config(default=None, home=None, local=None)
respectively. For each of these parameters you may alternatively specify a value of True
(meaning that you expect the respective config file to be loaded and otherwise an error is raised), or False
(meaning that the respective config file is not loaded, even if it exists). Additionally, you may specify one or more files that should additionally be loaded
config.init_config("/path/to/file_1.ini", "/path/to/file_2.ini", ...)
Corpora are loaded with the load
function
from corpusinterface import load
# load the corpus
corpus = load("Your Corpus", download=True)
Specifying download=True
indicates that the corpus should be downloaded if it cannot be found on disk. The load
function looks up the given corpus in the config, retrieving any parameters (including default parameters) specified there. If you provide additional keyword arguments, these will overwrite parameters from the config with the same name. So you could, for instance, specify a different URL for downloading
corpus = load("Your Corpus", url="some-other-url.com/corpus.zip" download=True)
or a custom path for looking for the corpus on disk and/or downloading it to
corpus = load("Your Corpus", path="/my/custom/path/for/corpus/" download=True)
Four parameters are processed by the load
function itself (download
, access
, url
, loader
). download
and url
play the obvious role described above.
access
specifies how the content should be accessed and together with url
is handled by the download
function (called by load
if download=True
is specified). access
can be a string ("git"
, "zip"
, or "tar.gz"
) resulting in the corpus being downloaded and unpacked accordingly. It can also be a callable provided as a keyword argument to load
. In that case the corpus path
is created on disk and the provided method is called with the corpus name and all keyword arguments, including any parameters specified in the config.
The loader
parameters is handled in a special way. If it is a callable, the load
function will ensure the corpus exists (potentially downloading it) and then call the specified method with all provided keyword arguments, including any parameters specified in the config. This means that you can simply specify any custom loader function you would like to use
corpus = load("Your Corpus", loader=my_special_loader_function)
If loader
is a string, load
tries to look up the appropriate function in the loaders
dictionary. So you can also add it there and request it by providing the corresponding string in the load
function
from corpusinterface import load, loaders
loaders["my custom loader"] = my_special_loader_function
corpus = load("Your Corpus", loader="my custom loader")
The advantage of this approach is that you can specify it in a config file so you don't need to pass it to load
each time
loader: my custom loader
Adding the loader function can also be automised. For instance, if you have a special corpus type that you provide in a separate python module, you can simply add the loader function there
from corpusinterface import loaders
class MySpecialCorpus:
...
loaders["my custom loader"] = MySpecialCorpus
Given your custom config file, you corpus can then be loaded simply as
corpus = load("Your Corpus")
without having to specify anything manually. Note that any loader function is provided with all keyword arguments, so it might be a good idea to use **kwargs
to handle any unforeseen additional parameters, even if they are not used.
The default corpus type is defined by the FileCorpus
class. In a config file, it is specified by
loader: FileCorpus
which is the default if this parameter is not explicitly specified for a corpus. When calling load
, they keyword argument loader="FileCorpus"
is looked up in loaders
and the actual FileCorpus
constructor is called. In fact, the static FileCorpus.init
method is called to check for the mandatory path
argument and provide an interpretable error message if it is missing. The FileCorpus
class assumes to find a collection of files at path
and makes them available via the files
and data
method. Additionally, accepts four more parameters:
file_regex
: a regular expression for file names; if provided, files whose name does not match are ignoredpath_regex
: a regular expression for paths; if provided, path (including the file name) that do not match are ignoredfile_exclude_regex
: like file_regex
but matches are ignorespath_exclude_regex
: like path_regex
but matches are ignoresAll additional keyword arguments are stored and passed on to calls of data
and metadata
.
files
The files
function returns an iterator over files (after applying the *_regex
expressions, if provided). It returns their absolute paths.
data
The data
function iterates over files
and attempts to read them. If a file_reader
function is provided as keyword argument upon initialisation or directly to data
, it is called with the full path of the respective file as first argument and all keyword arguments. Otherwise (or if file_reader=None
) data
returns the absolute paths just like files
.
metadata
The metadata
function looks for metadata in the path
location of the corpus. If a meta_reader
function is provided as keyword argument upon initialisation or directly to metadata
, it is called with the full path
of the corpus as first argument and all keyword arguments. Otherwise (or if meta_reader=None
) the full path
is returned.