Closed osahon-okungbowa closed 4 years ago
I have removed the old usage of dataset model, since as of yesterday we started using the scrapy pipeline. So no more dump()
calls.
Here's a summary of my changes:
dataset_list
in the parsers - replaced by yield dataset
whenever a dataset model is completedslugify
to make slugsDataset
and Resource
objects are now subclasses of python dict (Scrapy.Item
, to be more specific) - this is mostly for compliance with the Scrapy way of doing things, but also helps us a bit around the code. This time is made more sense to have it than on my first try.Item
now, no more need for **kwargs
, it's built in :-)I am going to run the code from this branch now (i.e. not merge yet), so you can have a look at the diff and maybe have suggestions.
ocr-parser complete. Completed the following:
I had to update the dump() for the Dataset model it allow for the creation of multiple unique json files from the same source_url. This is necessary to cater for pages that produce multiple datasets per page.
I also added kwargs to the Dataset & Resource model init() to allow me instantiate objects with attributes not yet defined. I didn't do anything with the kwargs.
I added a 'parsers' package to the 'ocr' package and a 'parser' module to the 'base' package