Right now the annif.corpus classes are a bit of a mess. They are trying to support different kinds of corpora:
SubjectIndex: subjects only as a TSV file + lookup and save functionality
SubjectIndexSKOS: subjects only as a SKOS file
SubjectDirectory: subjects with texts as a directory of TXT files
DocumentFile: documents with subjects as a TSV file
DocumentDirectory: documents with subjects as a directory of TXT + TSV files
Some conversions / views are supported, for example SubjectDirectory.from_documents allows converting a DocumentFile into a SubjectDirectory and FastTextBackend knows how to convert a SubjectDirectory into the fastText train file format which resembles DocumentFile. But this makes for messy code in AnnifProject.
Instead the classes should be seen as interfaces which provide access to the data in a uniform way regardless of the underlying storage. There could be a common abstract base class AnnifCorpus which provides methods/properties such as subjects (iterate through the available subjects as Subject objects, which are actually named tuples) and documents (iterate through the available documents, which could also be named tuples with text and subjects). These methods could perform the conversion behind the scenes, using temporary files or directories when necessary.
The index functionality (lookup by ID or URI) of SubjectIndex should be separated from the loading/saving part, which could be implemented by a class called SubjectFileTSV.
Right now the
annif.corpus
classes are a bit of a mess. They are trying to support different kinds of corpora:Some conversions / views are supported, for example SubjectDirectory.from_documents allows converting a DocumentFile into a SubjectDirectory and FastTextBackend knows how to convert a SubjectDirectory into the fastText train file format which resembles DocumentFile. But this makes for messy code in AnnifProject.
Instead the classes should be seen as interfaces which provide access to the data in a uniform way regardless of the underlying storage. There could be a common abstract base class AnnifCorpus which provides methods/properties such as
subjects
(iterate through the available subjects as Subject objects, which are actually named tuples) anddocuments
(iterate through the available documents, which could also be named tuples with text and subjects). These methods could perform the conversion behind the scenes, using temporary files or directories when necessary.The index functionality (lookup by ID or URI) of SubjectIndex should be separated from the loading/saving part, which could be implemented by a class called SubjectFileTSV.