Closed bstarling closed 7 years ago
@bstarling Can you clarify "Can handle volumes provided in background"?
@mkleinert Edited to be more clear. I was referring to this section
So, I'd like to get the conversation started, I want to say first I would love to help regardless of the technology chosen.
That being said, it sounds like we need something that has a very flexible or "schemaless" structure (i.e. schema-on-write), we need something that is used to being part of an analytics pipeline for both batch and streaming, and something that can deal with json natively or near natively, I don't think we need to worry too much about consistency or transaction, let me know if that sounds wrong. My thoughts at this point would be to avoid relational and graph databases for now and focus on either document databases or other NoSQL Dbs, meaning not simply a filesystem like Hadoop or S3.
That means, DynamoDB, HBase, Mongo kinds of solutions maybe even Cassandra, but it's a flexible columnar store, I believe all offered by Amazon (except maybe Mongo).
Would be good to get some feedback from others who have worked with the data or have a vision for how people within our org (or outside) may use or want to use the data. Tagging in @gati @wwymak @hadoopjax @nataliaking @rachelanddata @zacherybohon to see if they have more to add.
I'm not a data engineer by trade but from my work some opinions I have:
I think we are all leaning towards NoSQL solution, but I don't want to put words in anyones mouth. That would be my opinion according to the requirements. As much as I am not a big fan of Mongo that may be the best solution. I don't have any exposure to DynamoDB but that sounds like an equally good solution. My only hesitation to Dynamo would be you would be tying the solution to AWS
Also I am guessing we are looking for something that could possible support an unknown number and type of use cases. Mongo seems to really become a bit of a standard and most tools and systems support some type of Mongo connector or integration. But if we do have a few specific use cases it may be good to outline them now to see if the solution easily helps accomplish them.
My vote would be for MongoDB. I think it's important to also look at future use cases as well. Many use Hadoop for cheap storage/archiving as well as for any kind of ETL/batch processing - so if there's going to be any of that in the future, I would strongly recommend Hadoop with a MongoDB instance. Those familiar with SQL may also much rather prefer Hadoops Pig to Mongo's query language (especially for joins) so that could come in handy.
Not a final decision but who has experience installing/configuring mongoDB? If we can get a group together I think it makes sense to do a trial run with a subset of the data. I can provide the AWS instance, just let me know what you need.
From this conversation I also gather we should keep the historic files raw files somewhere for future use/load to other tools. Given how cheap S3 (or even glacier) is I think this is definitely doable.
Great input everyone. Let's keep the discussion going.
FWIW I played around with DynamoDB briefly while it is nice AWS manages everything it does look like it could get expensive pretty fast. Probably overkill for our current requirements.
I've installed and configured MongoDB on an Ubuntu VPS before. configured ssh from a local box into that Mongo instance to run MongoChef which is a nice GUI client to make querying easier for users. Not anywhere close to an expert though so could definitely use some help but I'm personally willing to give it a shot! :) Are you all planning to run this on a server or what would the preferred platform be?
On Sat, Jan 28, 2017 at 9:27 PM Benjamin Starling notifications@github.com wrote:
Not a final decision but who has experience installing/configuring mongoDB? If we can get a group together I think it makes sense to do a trial run with a subset of the data. I can provide the AWS instance, just let me know what you need.
From this conversation I also gather we should keep the historic files raw files somewhere for future use/load to other tools. Given how cheap S3 (or even glacier) is I think this is definitely doable.
Great input everyone. Let's keep the discussion going.
FWIW I played around with DynamoDB briefly while it is nice AWS manages everything it does look like it could get expensive pretty fast. Probably overkill for our current requirements.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/assemble/issues/12#issuecomment-275889306, or mute the thread https://github.com/notifications/unsubscribe-auth/AXTgY6F09vxbdA9JquNo_n6G7779xcHlks5rW_kTgaJpZM4Lvtxt .
Great thanks for @rachelanddata anyone else willing to lend a hand? @wwymak perhaps? I had assumed we would run it on an AWS EC2 instance but are there better options?
yeah, I am more than happy to help set up and populate the relavant mongoDBs :) If our pipeline is mostly in AWS then it make sense to use AWS EC2. ( We can do something like use a lambda function trigger to run every time new data arrive to write to database) There are other options e.g. mongolabs if we really don't want to use AWS ;)
While I'm not a big fan of mongo for production apps (although I happen to be a MongoDB Certified Developer :smile: ), I think this is a pretty good use case for it. And it would allow us to switch hosts if we ever get some donated infrastructure or something like that too. Not sure what my other time commitments will allow, but I'd like to help as possible.
I'm at Disney on vacation until Tuesday so please don't let me hold anything up if you want to get this started right away, I'll catch up!
On Sun, Jan 29, 2017 at 12:01 PM Michael J Erwin notifications@github.com wrote:
While I'm not a big fan of mongo for production apps (although I happen to be a MongoDB Certified Developer 😄 ), I think this is a pretty good use case for it. And it would allow us to switch hosts if we ever get some donated infrastructure or something like that too. Not sure what my other time commitments will allow, but I'd like to help as possible.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/assemble/issues/12#issuecomment-275927997, or mute the thread https://github.com/notifications/unsubscribe-auth/AXTgY8DTHN2FR-KlU6AwIleaxBZLisruks5rXMYDgaJpZM4Lvtxt .
How do we want to go about this? Maybe if @bstarling can set up a ec2 instance and we set up the mongodb, and we build a parser that saves one of the #far-right json files to the mongo? And also set up e.g. authentication? And then once we have a proof of concept we build a pipeline that automatically streams new data into the db?
I did a very rough proof of concept over in this branch in the discursive repo. It reads the index_twitter_stream that writes the file to s3/ES and dumps in a mongodb. My particular instance was a sandboxed env hosted in AWS but managed by mLab, they over free 0.5 GB sandboxes. As you can see below (hopefully) it looks like it parses the tweet fairly well, even getting the types right, I think (double check). I think maybe the next step is to write some kind of ETL out of S3 and into Mongo?
are we going with just dumping everything from the tweet model into mongo ? If we are going to do that then I think what we should be saving to s3 is the whole tweet object rather than the model?
Or if we are doing any extra parsing on top of the tweet model to save to mongo?
In any case @bstarling has set up a ec2 instance and I have got mongo setup on the default port on it-- let @bstarling know so he can give ssh access and we can test out various processes?
Re tweet model I still think it's best to let user define the data they want to save. For most purposes the main fields are fine, I think the only projects that would be using the d4d data store are the defined long term collection efforts. Ad hoc exploratory gathering can still go to csv/json/sqlite etc.
Separately, as part of the master plan (evil laugh) we are working on a configurable option which will dump the entire tweet to a local/s3 file.
@bstarling should we add a pipeline code in the scraping code for Mongo? Also should the solution be something like a docker container(s) or puppet/chef scripts? That way it can take off some of the work from any users may need to do to get this up and running.
For the spiders we'll need a more central infrastructure. We should get to a point where we have deploy scripts but I don't know if we need to make it so friendly that anyone can create an instance. Definitely have the option of piping stuff straight into mongo using pipelines (I've already done it with DynamoDB). Whether we go that route or batch/file load later is still TBD.
Hey guys, @gati pinged me on this, I thought I would chime in a bit, hopefully helpful.
First and foremost I am excited to help out with the project in any capacity. Good work this.
My company (http://www.eventador.io/) is happy to donate a production quality cluster to the cause. I suspect it will more than handle both archive/query as well as provide easy real-time capabilities that were mentioned above. We have notebooks built in, a SQL backend (or MongoDB or Elastic if you want).
To be quite honest, we would love to partner so we can get more real-world use cases and you guys can stretch our thinking and make us better at what we do. We are passionate about data, and this project is awesome. Hopefully win/win.
We are a real-time data processing platform, which, by design, is a super-set of just simply storing data. To start, you can immediately start putting data into the front door in JSON format, and it would flow into PostgreSQL/PipelineDB. If you really wanted MongoDB we could make that happen. We have a history with it. ;-). We have built in Jupyter notebooks as well. Also, it's secured via IP whitelist, no open-door MongoDB mistakes. We are currently adding Apache Storm computability to our platform, so you guys can use that too.
That said, if the relationship didn't work out, the data is in MongoDB format or whatever and you can simply more it somewhere else. The project owns the data as any customer would.
So if you guys think it's a fit then we are game. If not, no worries, I still want to help!
Taking @kgorman up on this awesome offer. Join us in #eventador if you'd like to participate.
Closing this as it looks like Eventador is our path forward.
Goal:
Consolidate the article and text data gathered from various websites into a single data store. This work is to support work being done by the #far-right team.
First Step:
Agree on appropriate data store. If you are familiar with a specific tool AND are willing to help us get started please post pro/cons about how this tool(s) may handle our requirements.
Background:
Community members have collected or donated article text data from various online communities and news sources. Chosen storage should be flexible enough to allow data model to change over time but structured enough to enable analysts to search data across multiple sources. New data is uploaded to s3 daily.
Examples of data we need to store:
Some basic cleaning/standardization has already been done. Current data is in below format and stored in json files on s3.
CURRENT data model
Required
language
: language of the texturl
: the url of an articletext_blob
: body of article/textsource
: source websiteOptional/If exists
authors
: the authors of a particular articlepub_date
: date the article was published. (Format: YYYY-MM-DD)pub_time
: time the article was published. Time should be stored in UTC if possible. (Format:HH:MM:SSZ)title
: the headline of a page/article/news itemlead
: opening paragraph, initial bolded text or summaryhrefs
: list of hrefs extracted from article or textmeta
: Non standard field. This field contains data specific to the source. May contain embedded json objects. Analysts should make sure they understand the data model used before relying on this field as it may be different across sources