Add command-line options for class, family, species, etc.

HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.

https://hobnobmancer.github.io/cazy_webscraper/

MIT License

12 stars 3 forks source link

Add command-line options for class, family, species, etc. #16

Closed widdowquinn closed 3 years ago

widdowquinn commented 3 years ago

The config file approach is great if you want or need to preserve or repeat your query.

If you want a "quick" search (e.g. when testing) then it would be good to have the option to specify options directly. It also lowers the barrier of entry for new users (not everyone knows how to create, or debug, a YAML file).

For instance:

cazy_webscraper -g me@my.domain -o outdir --class GH
cazy_webscraper -g me@my.domain -o outdir --family GH1
cazy_webscraper -g me@my.domain -o outdir --class GH --species Acinetobacter
cazy_webscraper -g me@my.domain -o outdir --family GH1 --species "Pectobacterium atrosepticum" -p pdb

HobnobMancer commented 3 years ago

That's a good point!

I'll add the command line option to specify families/classes to scrape, once I've updated a couple of the unit tests to add assertions and increase code coverage.

Is there a way to pass a list as an argument at the command line. For example, the user could specify GH1 and GH2 for scraping at the command line simultaneously, or would they have to call the scraper a separate time for each family?

HobnobMancer commented 3 years ago

Will the scraping by species name or tax id involve interacting with the CAZy website to perform a search then retrieving the data that way?

widdowquinn commented 3 years ago

Is there a way to pass a list as an argument at the command line. For example, the user could specify GH1 and GH2 for scraping at the command line simultaneously, or would they have to call the scraper a separate time for each family?

cazy_webscraper -g me@my.domain -o outdir --family GH1,GH2

Then parse the argument as:

arg.strip().split(",")

widdowquinn commented 3 years ago

Will the scraping by species name or tax id involve interacting with the CAZy website to perform a search then retrieving the data that way?

Yes.

We can get to class pages more or less directly. If a family is specified, then the class is easily inferred (capture the [A-Z]+ part of [A-Z]+[0-9]+). We can also check if families are already implied by a --class argument. We reduce the amount of checking/scraping we do - but we don't eliminate all searches to get to the appropriate page without intermediate checks.

HobnobMancer commented 3 years ago

Atm:

If specific classes/families are specified only those families are scraped from CAZy.
If a user wants to retrieve the CAZyome of a species then the entirety of CAZy needs to be scraped.

If a user wants to scrape specific class/families in their entirety and retrieve the CAZyome of a species then these two processes clash.

Do you think users would find it acceptable if specifying to scrape by a species scientific name takes precedent of scraping entire classes/families. In practise this would mean that if a scientific name (or names) is given then the CAZyome of the species is retrieved from CAZy. If the user specifies a species and classes/families then only those CAZyomes that belong to the species that and catalogued under the specified families/classes will be retrieved. This means that if any species are named the scraping of CAZy will be restricted to these species and if the user wants the complete class/family they will need to invoke the webscraper again?

widdowquinn commented 3 years ago

If I understand you correctly, I think we'd both expect the same behaviour:

If the user specifies --class only, all sequences of that class are recovered
If the user specifies --family only, or in addition to --class all sequences of that family are recovered
If the user specifies --genus or --species (really meaning "<genus> <species>" only then all sequences corresponding to that filtered binomial nomenclature are recovered
- If the user specifies --genus or --species in addition to --class or --family then the binomial is a filter on the results that would otherwise be returned. So, --class GH --family PL1 --species Dickeya would collect:
- all GH class sequences, and all PL1 family sequences that derive from a Dickeya

I think this is consistent with always applying the union of --class and --family (why specify both otherwise), but the intersection of (--class union --family) and --species (or --kingdom)

Is my intent clearer, here?

HobnobMancer commented 3 years ago

Yep that makes sense! I wanted to check we were on the same page before going ahead. Is the retrieval by kingdom you want me to look into after?

widdowquinn commented 3 years ago

I think it makes sense if you aren't interested in e.g. eukaryotes or archaea to only want to retrieve bacterial sequences.