Data4Democracy / discursive

Twitter topic search and indexing with Elasticsearch
21 stars 11 forks source link

Migrate to Kinesis & Lambda (serverless) #11

Closed hadoopjax closed 7 years ago

hadoopjax commented 7 years ago

In conversation with @ASRagab, @bstarling and @nataliaking over the past several weeks we've contemplated migrating the Discursive application to a 'serverless' architecture supported by Kinesis and Lambda. In terms of desired functionality (delivering Twitter data to researchers) the master branch works just fine, albeit requiring a level of infrastructure expertise our research colleagues may not possess. To that end, we would like to gather community feedback regarding whether this migration to a serverless architecture is something we should pursue. Please provide feedback if you have it, we'd love to hear from you!

nix-bohon commented 7 years ago

I like it. Is there a reason to prefer AWS Kinesis over Apache Kafka?

wwymak commented 7 years ago

I think it's a good idea -- having it being able to run in a serverless architecture means that people who just want an easy way to get the data has less of hurdle to overcome. As for @zacherybohon comment about kinesis vs Kafka-- if you use Kafka you'd have to manage your own architecture whereas the point of Kinesis is that all of that faff of making sure everything scales etc is managed for you ;)

nataliaking commented 7 years ago

I have never used Lambda with AWS Elasticsearch service but taking a peek at docs, it is doable (http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html)

Also agree with Kinesis idea 👍

I just came back from vacation the other day but getting back into work and stuff now! Sorry for lack of activity recently. Will catch up.

nataliaking commented 7 years ago

One important thing I wanted to note now is in AWS Lambda, there are some limitations with using Python that will affect us.

bstarling commented 7 years ago

Anyone have experience configuring this in AWS? I was able to get a proof concept working on lambda but I was having issues accessing both public internet (twitter API) and S3 within the same lambda process. Some googling turned up this is probably related to VPC configuration but didn't get much past that.

I could have lambda return the tweets in as a json blob Or I could have a lambda function that saved a random blob to S3 but when I tried to combine it I always received timeout error.

Very new to lambda but I found this little package pretty nice for bundling and shipping your lambda functions.

bstarling commented 7 years ago

Ugh, I knew it was something simple. @wwymak Helped me out. My function was running out of memory. Toy example now working. note I was experimenting to get familiar with Lambda, not trying to claim this issue :)

ASRagab commented 7 years ago

@nataliaking thanks for sharing, and yeah, it's never too much! So, as noted this is a limitation in part due to Python, and part due to the target architecture that the interpreter is running on locally. Is there something unique about your setup that created this architecture mismatch, or is the general case that an EC2 instance is required?

Is there any advantage to moving to another language (i.e. Java or C#), or any appetite?

Also, I guess I want to understand the use cases a bit more that we think Kinesis/Serverless might solve, and which if any would we be interested in targeting:

  1. I as a backend/infra engineer can create new data pipelines for downstream consumption quickly and in a scaleable way
  2. I as curation specialist can kick off new data searches/streams without trifling with git repos, s3 buckets, etc.
  3. I as a data analyst can read available data streams and choose to deposit data locally or analyze the stream as such (a la PySpark)
  4. Others?
bstarling commented 7 years ago

Regarding the architecture it has to do with platform the C extensions are compiled for. If you use a library with a C extension and compile on mac then try to run on AWS linux that's what causes the issue mentioned. Currently our code does not require any C extensions. I was able package and upload directly without issue.

RE: Language: I think most people that would be working with tools (and potentially interested in contributing) are familiar with python which is the reason I would suggest we stick with it if possible. If there is a proposal to build pieces in another language that we simply cannot or would not want to implement in python we would just want to make sure we 1) Have the people familiar enough with chosen language to build it 2) Does not impact our usability since for the most part python seems to be the least common denominator for this sort of thing.

RE: Infra, IMO: 1) Thumbs up! 2) Hopefully one day. A incremental first step I think we should shoot for "Without requiring intimate knowledge of AWS" I.E. easy process for adding streams to existing infrastructure with tutorials, deploy scripts etc. 3) Sounds reasonable

A little off topic of the actual issue but In my mind there should be a range of engaging

Maybe these are completely separate but I think we can work together for now.

metame commented 7 years ago

So while poking around in Lambda docs turns out that for now it only works with Python 2.7. Might be another reason to hold off on a python3 refactor for the project.

zachmueller commented 7 years ago

Just randomly jumping in here, but I've done a good amount of work with Lambda (even using libraries with C extensions, specifically pandas/numpy/scipy) so could help answer questions on that front. I've also briefly dabbled in linking kinesis and Lambda. That said, I don't have much bandwidth to build this out myself, so I'm limited to answering specific questions. Overall I think it'd be worth trying out.

hadoopjax commented 7 years ago

@metame - totally forgot I ran into that the other day while hacking together a prototype. Great point.

@zachmueller cool, thanks!

Later today I'm going to create a discursive-serverless branch and dump what I have out there for us all to poke around on (if nothing else, it's a starting point).

alexcasalboni commented 7 years ago

Hi everyone,

just adding a few thoughts to the conversation.

AWS Lambda:

Amazon Kinesis:

hadoopjax commented 7 years ago

Thanks everyone for reaching out on this issue I think it generated a lot of great ideas. As we decompose Discursive components into other repos (Assemble, Collect-Social) we'll migrate our infrastructure discussions into those spaces.