Closed bstarling closed 6 years ago
I really like the idea. I think it casts a wide net in terms of capturing a variety of ways for researchers to get their hands on the data. And, since we'd enable several of these storage options (like above), we'll likely have flexibility to add others if a contributor comes along and wants to add something else (i.e. Redshift or something).
+1 from me
Started playing around with this. Still in the very early stages but first step was to start moving all the config/logic to separate processes. Got it working on python3 but probably broke python2.
Eventually I'm thinking a process like get_backend
will take the user settings/CLI arguments and initiate a backend from helpers in /backends
Thoughts? @hadoopjax @ASRagab
Note I temporarily switched back to hardcoded config just to make setup / debugging easier for me (new to using kibana)
https://github.com/Data4Democracy/discursive/commit/3c0adaa1ecf302d38ed3f7a53ac7768228c408f2
On that note, what to do about this python 2 vs 3 question?
I am not terribly familiar with the python ecosystem, are there important data science, machine learning, utility libraries still that have not been updated to python3?
Are we worried about newcomers on systems where python2 is still the default (i.e. macOS) and don't want to add the complexity of using virtualenv or anaconda and such?
I like the approach so far in your commit. I think because this ties in so closely with the CLI task I think we should do things in the following order
config.py
and choose backend--config /path/to/config/file
--backend [s3, es, local] --output [csv, json, sqllite] --credentials /path/to/credentials/file
Spoke with @hadoopjax , I think we want to move forward with python3 conversion.
Order of tasks looks good to me. Will keep working in this direction.
In order to generalize the tool to make it useful to the widest audience I think it would be beneficial to support multiple options for saving tweets with the ability to configure 1 or more option. In order to accomplish this I think we should refactor any of the storage related work to a separate process that uses config to determine how/where to save tweet. If we agree on the approach to do this, we can open new issues for each back end we want to add.
For starters: