Closed widdowquinn closed 3 years ago
That's a good point!
I'll add the command line option to specify families/classes to scrape, once I've updated a couple of the unit tests to add assertions and increase code coverage.
Is there a way to pass a list as an argument at the command line. For example, the user could specify GH1 and GH2 for scraping at the command line simultaneously, or would they have to call the scraper a separate time for each family?
Will the scraping by species name or tax id involve interacting with the CAZy website to perform a search then retrieving the data that way?
Is there a way to pass a list as an argument at the command line. For example, the user could specify GH1 and GH2 for scraping at the command line simultaneously, or would they have to call the scraper a separate time for each family?
cazy_webscraper -g me@my.domain -o outdir --family GH1,GH2
Then parse the argument as:
arg.strip().split(",")
Will the scraping by species name or tax id involve interacting with the CAZy website to perform a search then retrieving the data that way?
Yes.
We can get to class pages more or less directly. If a family is specified, then the class is easily inferred (capture the [A-Z]+
part of [A-Z]+[0-9]+
). We can also check if families are already implied by a --class
argument. We reduce the amount of checking/scraping we do - but we don't eliminate all searches to get to the appropriate page without intermediate checks.
Atm:
If a user wants to scrape specific class/families in their entirety and retrieve the CAZyome of a species then these two processes clash.
Do you think users would find it acceptable if specifying to scrape by a species scientific name takes precedent of scraping entire classes/families. In practise this would mean that if a scientific name (or names) is given then the CAZyome of the species is retrieved from CAZy. If the user specifies a species and classes/families then only those CAZyomes that belong to the species that and catalogued under the specified families/classes will be retrieved. This means that if any species are named the scraping of CAZy will be restricted to these species and if the user wants the complete class/family they will need to invoke the webscraper again?
If I understand you correctly, I think we'd both expect the same behaviour:
--class
only, all sequences of that class are recovered--family
only, or in addition to --class
all sequences of that family are recovered--genus
or --species
(really meaning "<genus> <species>
" only then all sequences corresponding to that filtered binomial nomenclature are recovered
--genus
or --species
in addition to --class
or --family
then the binomial is a filter on the results that would otherwise be returned. So, --class GH --family PL1 --species Dickeya
would collect:I think this is consistent with always applying the union of --class
and --family
(why specify both otherwise), but the intersection of (--class
union --family
) and --species
(or --kingdom
)
Is my intent clearer, here?
Yep that makes sense! I wanted to check we were on the same page before going ahead. Is the retrieval by kingdom you want me to look into after?
I think it makes sense if you aren't interested in e.g. eukaryotes or archaea to only want to retrieve bacterial sequences.
The config file approach is great if you want or need to preserve or repeat your query.
If you want a "quick" search (e.g. when testing) then it would be good to have the option to specify options directly. It also lowers the barrier of entry for new users (not everyone knows how to create, or debug, a YAML file).
For instance: