Parallel extraction from various data sources

mamerisawesome commented 4 years ago

With the intent to make the library as a central hub for various datasets, we might need to find a way to reduce extraction overheads. One thing I have in mind is to make extractions happen in parallel.

Issues and / or Suggestions

Consideration for various compute instances on how to avoid CPU failures due to spike
- Can define a parameter (through an environment variable or function parameter) to turn off parallelization
- Also possible to automatically identify CPU capacity to seamlessly execute strategies depending on the environment
Spacy and NLTK have capabilities to download datasets that you'd be needing; maybe we can mimic these as well as to allow programmers extract columns from sources when they're needed

e.g. When programmer only need the ff. columns: ["case_no", "lat", "long"], the library can give you a hint to install / download datasets that satisfy the columns they need.

from phcovid import get_cases

get_cases(columns=["case_no", "lat", "long"])
# will throw error with message to do `phcovid.download("arcgis_dataset")`

# then user will now get intuition to do that step
phcovid.download("arcgis_dataset")
get_cases(["case_no", "lat", "long"])
# will return the cases

Note

These may be an issue that we don't need right now as current processes do not hinder analyses of data.

enzoampil commented 4 years ago

Thanks for this @mamerisawesome ! I like the vision :) One perspective I have is that we can patter nthese to specific applications of the datasets.

E.g. network analysis (which @andrewnyu is working on), simulation modelling, potentially even applying NLP to document data from social media.

For twitter, I imagine a get_tweets function that returns a corpus of tweets with covid related hashtags / keywords. Then, we can apply more modelling on top of this :)

mamerisawesome commented 4 years ago

I'd like to add that if we're going to parallel route, this Python module may be of use to us for running asynchronous Python code. Link is here which is from PEP so Python recommended.

enzoampil / phcovid

Parallel extraction from various data sources #16

Issues and / or Suggestions

Note