century-arcade / xd

a futureproof crossword corpus toolset
MIT License
221 stars 26 forks source link

wapost: only on sundays #3

Closed saulpw closed 8 years ago

saulpw commented 8 years ago

The other 6 days will fail, so don't bother with those requests.

Also theglobeandmail_canadian.

vdraceil commented 8 years ago
[localhost][development]~/Workplace/Freelance/web_scraping/Saul_Pwanson/xd
>>>python main.py --download-xd -s theglobeandmail_canadian -o a.zip -f 2016-02-06 -t 2016-02-16                                                               
Processing Crossword for date - 2016-02-06
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-07
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-08
Processing Crossword for date - 2016-02-09
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-10
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-11
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-12
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-13
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-14
        ERR: No Crossword for date
Processing Crossword for date - 2016-02-15
Processing Crossword for date - 2016-02-16
        ERR: No Crossword for date

[localhost][development]~/Workplace/Freelance/web_scraping/Saul_Pwanson/xd
>>>unzip a.zip 
Archive:  a.zip
  inflating: crosswords-theglobeandmail_canadian/2016/theglobeandmail_canadian-2016-02-08.xd  
  inflating: crosswords-theglobeandmail_canadian/2016/theglobeandmail_canadian-2016-02-15.xd 

[localhost][development]~/Workplace/Freelance/web_scraping/Saul_Pwanson/xd
>>>

As you can see, the scraper fails and silently moves over days for which there is no puzzle. In the above executed example theglobeandmail_canadian has valid puzzles only on 8Feb2016 and 15Feb2016 and these are exactly the files that have been created in the output zip.

Changing the generic code for such specific websites is not much of a good idea because, -

  1. in future, what if the website decides to post puzzles on Fridays instead of Sundays? - we have to go in and modify the code, check-in, re-test & deploy; a bit of a overhead.
  2. or what if the website post a special puzzle on some special day (like new year or something)? - we would definitely miss out those extra puzzles if we hard code to hop in intervals of 7 days from some base date (8 Feb).

... with the current generic way we have the code, the above scenarios will be covered with ease. Its okay to let the scraper check the website for a puzzle and fail if it is not available - we're good as long as the failure doesn't abruptly stop execution.