CormacCollins / product_crawler

1 stars 0 forks source link

nutricia crawler created #11

Closed Ryan-Ashton closed 4 years ago

Ryan-Ashton commented 4 years ago

Cheeky crawler created. (CCC) - Cormac C Collins @CormacCollins

CormacCollins commented 4 years ago

Looks good, now needs to be split between 'link loader' class and 'crawler' class. I haven't seen the output, but If you haven't, I would also recommend doing as much cleaning of the text at this stage instead of doing it at the 'workbook' stage. This is because we can be confident that this step is forever automated and won't need to be changed much once it's running, where as making small changes in the 'workbook' in the future might also then require a bigger re-factoring of that workbook code. So we are essentially giving a good amount of work to the crawler to minimise having to do it at the 'workbook' level, the workbook level in mind should be as much as possible for the modelling/setting up of the data structure. Something simple like removing a bunch of Unicode characters or newline characters is an easy thing to be done from the crawler.