Corpus Loading Features

I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:

A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:

class BaseScrapper(scrappy.Spider):
         def __init__(name, urls, **kwargs):
               super(BaseScrapper, self).__init__(name, **kwargs)

         def parse_urls(self):
                ###Do something to the URLs before starting
                pass         

         def parse(self):
               #Crawling logic
               pass

Then a scrapper like the Bibeli scrapper can use this class:

class BibeliScrapper(BaseScrapper)
          ###Logic goes here

Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.

Corpus class and DirectoryCorpus classs (Inspired by gensim) This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:

Streaming files
Reading various file formats. txt, gzip, csv,
Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
Preprocessing while reading.
Generating random text

A commit for this is available here The interface is described below:


class Corpus(interfaces.CorpusABC):
    def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
        """

        Args:
            path:
            text:
        """
        self.path = path
        self.text = text
        self.labels = labels
        self.stream = stream
        self.fformat = fformat
        self.cformat = cformat
        self.preprocess = preprocess
        if not self.preprocess:
            self.preprocess = [normalize_diacritics_text]
        self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
        self.validate_format()

    def __iter__(self):
        for line in self.data:
            yield line

    def __len__(self):
        return len(self.data)

    @staticmethod
    def save_corpus(fname, corpus, id2word=None, metadata=False):
        pass

    def streamfile(self, fobj):
        pass

    def read_file_filename_or_text(self, f=None, text=None):
        """

        Returns:

        """
        pass

    def handle_preprocessing(self, text):
        if callable(self.preprocess):
            return self.preprocess(text)
        if isinstance(self.preprocess, list):
            for technique in self.preprocess:
                text = technique(text)
            return text

    def validate_format(self):
        """

        Returns:

        """

    def generate(self, size):
        """

        Args:
            size:

        Returns:

        """
        if not self.cformat:
            raise ValueError("You need to specify a format for generating random text")

class DirectoryCorpus(Corpus):
    def __init__(self, path, **kwargs):
        self.path_dir = path
        walked = list(walk(self.path_dir))
        self.depth = walked[0][0]
        self.dirnames = walked[0][2]
        self.flist = walked[0][3]
        self.path = list(self.read_files())
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

    def read_files(self):
        for path in self.flist:
            yield os.path.join(self.path_dir, path)

Loaders : These would be responsible for loading corpus made available by iranlowo.. They should return a Corpus object.

class OweLoader(DirectoryCorpus):
    def __init__(self, path, **kwargs):
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.

Niger-Volta-LTI / iranlowo

Corpus Loading Features #12