Data4Democracy / discursive

Twitter topic search and indexing with Elasticsearch
21 stars 11 forks source link

Configurable storage options #13

Closed bstarling closed 6 years ago

bstarling commented 7 years ago

In order to generalize the tool to make it useful to the widest audience I think it would be beneficial to support multiple options for saving tweets with the ability to configure 1 or more option. In order to accomplish this I think we should refactor any of the storage related work to a separate process that uses config to determine how/where to save tweet. If we agree on the approach to do this, we can open new issues for each back end we want to add.

For starters:

hadoopjax commented 7 years ago

I really like the idea. I think it casts a wide net in terms of capturing a variety of ways for researchers to get their hands on the data. And, since we'd enable several of these storage options (like above), we'll likely have flexibility to add others if a contributor comes along and wants to add something else (i.e. Redshift or something).

+1 from me

bstarling commented 7 years ago

Started playing around with this. Still in the very early stages but first step was to start moving all the config/logic to separate processes. Got it working on python3 but probably broke python2.

Eventually I'm thinking a process like get_backend will take the user settings/CLI arguments and initiate a backend from helpers in /backends

Thoughts? @hadoopjax @ASRagab

Note I temporarily switched back to hardcoded config just to make setup / debugging easier for me (new to using kibana)

https://github.com/Data4Democracy/discursive/commit/3c0adaa1ecf302d38ed3f7a53ac7768228c408f2

ASRagab commented 7 years ago

On that note, what to do about this python 2 vs 3 question?

I am not terribly familiar with the python ecosystem, are there important data science, machine learning, utility libraries still that have not been updated to python3?

Are we worried about newcomers on systems where python2 is still the default (i.e. macOS) and don't want to add the complexity of using virtualenv or anaconda and such?

I like the approach so far in your commit. I think because this ties in so closely with the CLI task I think we should do things in the following order

  1. First have index scripts automatically read config.py and choose backend
  2. Create CLI plumbing with one argument --config /path/to/config/file
  3. Create robust CLI with individual arguments: --backend [s3, es, local] --output [csv, json, sqllite] --credentials /path/to/credentials/file
bstarling commented 7 years ago

Spoke with @hadoopjax , I think we want to move forward with python3 conversion.

Order of tasks looks good to me. Will keep working in this direction.