HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Do not make complete download of CAZy the default operation #49

Closed widdowquinn closed 2 years ago

widdowquinn commented 3 years ago

Is your feature request related to a problem? Please describe.

By making the default operation of cazy_webscraper be "download all of CAZy" it is very easy for users to overwhelm the service, denying it to others.

Describe the solution you'd like

Either restrict operation only to download of specific classes/families, or make it more difficult to specify download of the complete CAZy database.

Whatever solution is used, downloading the entire database should be a conscious act for the user, not the default when running the tool with no arguments.

widdowquinn commented 3 years ago

Additionally, I would suggest strongly restricting the rate of requests when a user asks to download the complete database. A warning should also be provided.

For instance, if downloading the entire CAZy database was only possible when the --complete_download is set, then the output may look like:

$ cazy_webscraper --complete_download
[WARNING] Downloading the complete CAZy database uses a large amount of bandwidth and may cause the CAZy service to deteriorate for other users.
[WARNING] Please consider downloading only the sequences you need, if possible.
[WARNING] Due to the large size of this request, the rate of requests will be limited to <SOME VALUE> per second to help maintain CAZy service for other users.
HobnobMancer commented 2 years ago

This is no longer relevant for cazy_webscraper V2 because downloading all of CAZy requires one call to CAZy to download the txt file, therefore, default behaviour is to download all of CAZy - especially as this only takes 10-15 minutes.