If download is interrupted, no intermediate results are stored.

widdowquinn commented 3 years ago

Downloading of significant amounts of data may take some time. If there is an interruption for any reason, the script stops, but none of the gathered data is available to the user. This could be extremely frustrating and discourage reuse.

Some options to provide kinder behaviour could include:

place all downloaded data in a local SQLite3 database (my preferred option; it is transactional, so requires a transaction to complete in order to update - this avoids partial data dumps; also, the database is persistent and can be targeted readily with expansions/plugins for other tools)
write all data to a growing file/files (this could end up with syncing issues if the program is interrupted part-way through a write)

widdowquinn commented 3 years ago

Example download failure:

$ cazy_webscraper.py -g me@my.domain -l test.log -o outdir
cazy_webscraper: 2020-12-03 14:26:51,639 - Run initiated
cazy_webscraper: 2020-12-03 14:26:51,639 - Creating directory outdir
cazy_webscraper: 2020-12-03 14:26:51,640 - Finished program preparation
cazy_webscraper: 2020-12-03 14:26:51,640 - Starting retrieval of data from CAZy
cazy_webscraper: 2020-12-03 14:26:51,640 - Retrieving URLs to summary CAZy class pages
[...]
Retrieving proteins from GH13: 16000it [02:03, 130.03it/s]
Parsing CAZy families:   7%|█████████████▍                                                                                                                                                                               | 12/169 [18:58<4:08:14, 94.87s/it]
Parsing CAZy classes:   0%|                                                                                                                                                                                                           | 0/6 [18:59<?, ?it/s]
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/cazy-test-env/bin/cazy_webscraper.py", line 33, in <module>
    sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper.py')())
[...]
  File "/Users/lpritc/opt/anaconda3/envs/cazy-test-env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt
$ ls outdir/

HobnobMancer commented 3 years ago

Added. All SQL interaction is under the scraper.sql module

HobnobMancer / cazy_webscraper

If download is interrupted, no intermediate results are stored. #18