HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

If download is interrupted, no intermediate results are stored. #18

Closed widdowquinn closed 3 years ago

widdowquinn commented 3 years ago

Downloading of significant amounts of data may take some time. If there is an interruption for any reason, the script stops, but none of the gathered data is available to the user. This could be extremely frustrating and discourage reuse.

Some options to provide kinder behaviour could include:

widdowquinn commented 3 years ago

Example download failure:

$ cazy_webscraper.py -g me@my.domain -l test.log -o outdir
cazy_webscraper: 2020-12-03 14:26:51,639 - Run initiated
cazy_webscraper: 2020-12-03 14:26:51,639 - Creating directory outdir
cazy_webscraper: 2020-12-03 14:26:51,640 - Finished program preparation
cazy_webscraper: 2020-12-03 14:26:51,640 - Starting retrieval of data from CAZy
cazy_webscraper: 2020-12-03 14:26:51,640 - Retrieving URLs to summary CAZy class pages
[...]
Retrieving proteins from GH13: 16000it [02:03, 130.03it/s]
Parsing CAZy families:   7%|█████████████▍                                                                                                                                                                               | 12/169 [18:58<4:08:14, 94.87s/it]
Parsing CAZy classes:   0%|                                                                                                                                                                                                           | 0/6 [18:59<?, ?it/s]
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/cazy-test-env/bin/cazy_webscraper.py", line 33, in <module>
    sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper.py')())
[...]
  File "/Users/lpritc/opt/anaconda3/envs/cazy-test-env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt
$ ls outdir/
HobnobMancer commented 3 years ago

Added. All SQL interaction is under the scraper.sql module