NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
204 stars 41 forks source link

Refactor corpus classes #136

Closed osma closed 6 years ago

osma commented 6 years ago

Right now the annif.corpus classes are a bit of a mess. They are trying to support different kinds of corpora:

Some conversions / views are supported, for example SubjectDirectory.from_documents allows converting a DocumentFile into a SubjectDirectory and FastTextBackend knows how to convert a SubjectDirectory into the fastText train file format which resembles DocumentFile. But this makes for messy code in AnnifProject.

Instead the classes should be seen as interfaces which provide access to the data in a uniform way regardless of the underlying storage. There could be a common abstract base class AnnifCorpus which provides methods/properties such as subjects (iterate through the available subjects as Subject objects, which are actually named tuples) and documents (iterate through the available documents, which could also be named tuples with text and subjects). These methods could perform the conversion behind the scenes, using temporary files or directories when necessary.

The index functionality (lookup by ID or URI) of SubjectIndex should be separated from the loading/saving part, which could be implemented by a class called SubjectFileTSV.

osma commented 6 years ago

Next steps:

osma commented 6 years ago

This is good enough for now. Made new issues for the remaining follow-up tasks