MrDiggles2 / cru-scrape

Scraper of CRU sites
0 stars 0 forks source link

Update crawling script to replace consecutive instances of whitespace with a single space. #15

Closed MrDiggles2 closed 1 month ago

MrDiggles2 commented 1 month ago

For example, instead of

                Department\n      of Fishery and Wildlife Sciences  New\n      Mexico State University  P.O.\n      Box 30003, Campus Box 4901  Las\n      Cruces, New Mexico 88003-0003  phone\n    ...

We should preprocess so that it looks like

Department of Fishery and Wildlife Sciences New Mexico State University P.O. Box 30003, Campus Box 4901 Las Cruces, New Mexico 88003-0003 phone ...

Relevant code is in src/spider.py (probably under text_from_html)

MrDiggles2 commented 1 month ago

Closed with https://github.com/MrDiggles2/cru-scrape/pull/16