I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:
A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:
class BaseScrapper(scrappy.Spider):
def __init__(name, urls, **kwargs):
super(BaseScrapper, self).__init__(name, **kwargs)
def parse_urls(self):
###Do something to the URLs before starting
pass
def parse(self):
#Crawling logic
pass
Then a scrapper like the Bibeli scrapper can use this class:
class BibeliScrapper(BaseScrapper)
###Logic goes here
Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.
Corpus class and DirectoryCorpus classs (Inspired by gensim)
This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:
Streaming files
Reading various file formats. txt, gzip, csv,
Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
Preprocessing while reading.
Generating random text
A commit for this is available here
The interface is described below:
class Corpus(interfaces.CorpusABC):
def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
"""
Args:
path:
text:
"""
self.path = path
self.text = text
self.labels = labels
self.stream = stream
self.fformat = fformat
self.cformat = cformat
self.preprocess = preprocess
if not self.preprocess:
self.preprocess = [normalize_diacritics_text]
self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
self.validate_format()
def __iter__(self):
for line in self.data:
yield line
def __len__(self):
return len(self.data)
@staticmethod
def save_corpus(fname, corpus, id2word=None, metadata=False):
pass
def streamfile(self, fobj):
pass
def read_file_filename_or_text(self, f=None, text=None):
"""
Returns:
"""
pass
def handle_preprocessing(self, text):
if callable(self.preprocess):
return self.preprocess(text)
if isinstance(self.preprocess, list):
for technique in self.preprocess:
text = technique(text)
return text
def validate_format(self):
"""
Returns:
"""
def generate(self, size):
"""
Args:
size:
Returns:
"""
if not self.cformat:
raise ValueError("You need to specify a format for generating random text")
class DirectoryCorpus(Corpus):
def __init__(self, path, **kwargs):
self.path_dir = path
walked = list(walk(self.path_dir))
self.depth = walked[0][0]
self.dirnames = walked[0][2]
self.flist = walked[0][3]
self.path = list(self.read_files())
super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)
def read_files(self):
for path in self.flist:
yield os.path.join(self.path_dir, path)
Loaders : These would be responsible for loading corpus made available by iranlowo.. They should return a Corpus object.
class OweLoader(DirectoryCorpus):
def __init__(self, path, **kwargs):
super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)
I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.
I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:
Then a scrapper like the Bibeli scrapper can use this class:
Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.
Owe
file. It should be able to validate that the content of the file conforms to that format.A commit for this is available here The interface is described below:
iranlowo.
. They should return aCorpus
object.I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.