Consolidate text data into a single data store.

bstarling commented 7 years ago

Goal:

Consolidate the article and text data gathered from various websites into a single data store. This work is to support work being done by the #far-right team.

First Step:

Agree on appropriate data store. If you are familiar with a specific tool AND are willing to help us get started please post pro/cons about how this tool(s) may handle our requirements.

Able to store/query at least ~10 million records (see examples below)
Data is readily available for our analysts.
Access can be restricted
Cost conscious but does not have to be free

Background:

Community members have collected or donated article text data from various online communities and news sources. Chosen storage should be flexible enough to allow data model to change over time but structured enough to enable analysts to search data across multiple sources. New data is uploaded to s3 daily.

Examples of data we need to store:

an average case popular news archive is about ~250,000 rows (800 mb in raw json unzipped). Adding about ~75,000-100,000 rows a year.
Our largest identified source comes from a web forum and has about 3-5 million historical posts and adding thousands per day (TBD how much of the archive we store in live data set)
If you would like to get your hands on some real data contact me for s3 access.

Some basic cleaning/standardization has already been done. Current data is in below format and stored in json files on s3.

CURRENT data model

Required language: language of the text
url: the url of an article
text_blob: body of article/text
source: source website

Optional/If exists authors: the authors of a particular article
pub_date: date the article was published. (Format: YYYY-MM-DD)
pub_time: time the article was published. Time should be stored in UTC if possible. (Format:HH:MM:SSZ)
title: the headline of a page/article/news item
lead: opening paragraph, initial bolded text or summary
hrefs: list of hrefs extracted from article or text
meta: Non standard field. This field contains data specific to the source. May contain embedded json objects. Analysts should make sure they understand the data model used before relying on this field as it may be different across sources

mkleinert commented 7 years ago

@bstarling Can you clarify "Can handle volumes provided in background"?

bstarling commented 7 years ago

@mkleinert Edited to be more clear. I was referring to this section

an average case popular news archive is about ~250,000 rows (800 mb in raw json unzipped). Adding about ~75,000-100,000 rows a year.
Our largest identified source comes from a web forum and has about 3-5 million historical posts and adding thousands per day (TBD how much of the archive we store in live data set)

ghost commented 7 years ago

So, I'd like to get the conversation started, I want to say first I would love to help regardless of the technology chosen.

Most of this data I sense we would want to keep for a while (possibly permanently) if that's the case volume would be a strong constraint.
Since we will be retrieving data from lots of sources, variety is a strong consideration as well. Though, it looks like we need mostly text data stores, and blobs and images are less of a concern.
One that I am not sure about is velocity and the temperature of the data, will be doing a lot of simultaneous reading and writing, is there a correlation between recency of data and its need to be queried.
Will we need "near real-time" or streaming capabilities, this means non-blocking asynchronous io access capabilities, or some additional streaming layer on top.
Complexity of queries (joins, complex aggregations, object-graphs?)
Are we wedded to some cloud provider solution (probably), and are we further constrained you think to AWS (also probably) ?

That being said, it sounds like we need something that has a very flexible or "schemaless" structure (i.e. schema-on-write), we need something that is used to being part of an analytics pipeline for both batch and streaming, and something that can deal with json natively or near natively, I don't think we need to worry too much about consistency or transaction, let me know if that sounds wrong. My thoughts at this point would be to avoid relational and graph databases for now and focus on either document databases or other NoSQL Dbs, meaning not simply a filesystem like Hadoop or S3.

That means, DynamoDB, HBase, Mongo kinds of solutions maybe even Cassandra, but it's a flexible columnar store, I believe all offered by Amazon (except maybe Mongo).

bstarling commented 7 years ago

Agree, in discussion of our options I think we should not throw out options just because it cannot handle every foreseeable volume requirement. Perhaps we decide to archive anything over 2 years old. I could see a world where we go with easy solution over storing everything in existence. (IE forum posts from 2007 probably not that important).
Good point! There has been talk of storing images, I can see many use cases for doing that. We could store images on S3 and just provide a asset location in the data store? Not sure what all options we have here?
I do not think the data has to be super available. I would say ease of use over pure availability. Even the fastest scrapers are producing no more than 250 records a minute.
For the tweets/social media stuff probably yes but we have not considered if tweets and the web data would be stored together or separate.
Right now all the article/text data is in one object I am not sure what kind of joins would be possible if we wanted to combine twitter/facebook/article data together. Hopefully others will chime in.
Just because we already have a process setup AWS > others but I think a good long term solution we could work with any provider. Managing our own servers and services might be too much for community/volunteer org though.

Would be good to get some feedback from others who have worked with the data or have a vision for how people within our org (or outside) may use or want to use the data. Tagging in @gati @wwymak @hadoopjax @nataliaking @rachelanddata @zacherybohon to see if they have more to add.

wwymak commented 7 years ago

I'm not a data engineer by trade but from my work some opinions I have:

The large volumes of data may incur extra costs but if we go with a very cheap option for the data that is very far in the past (and zip the stored data!) it means we don't spend a fortune for storing data but it's still there for whoever needs it. (e.g. S3 has Glacier store for archives which is much cheaper than the higher availability options)
Since the data we have from different sources are so varied, my personal choice will be for noSQL such as mongo (AWS don't have mongo as a service but it is trivial to set one up in a virtual machine). For other blob data such as images etc it makes more sense to store in s3 since databases isn't really for storing non standard data-- e.g in mongoDB each document isn't allowed more than 15MB or so. In regards to joins etc in the database -- mongoDB isn't really setup for joins but if you need to combine dataset from multiple collections as long as there are fields that are standardised, e.g. timestamp should be stored as iso datetime its not hard to have a layer on top of database that does these combinations for you. Obviously this may be less optimal than doing it in the database itself... If we need text searching mongo also has a connector to elasticsearch (and also to neo4j if we want to connect our text data to a graph db)
for ease of access -- again, if we are using databases then they should handle simulateneous requests without issues?
Cloud services such as AWS, Google, Azure etc have fairly similar data store/compute options e.g. they more or less all have something similar to S3 , lambda, virtual servers etc. So as long as we keep our 'save data to xx' as a separate module in our tools whoever needs to use them can swap in whatever they need (including streaming to their own local setup) This means we are not tied to one cloud provider.

mkleinert commented 7 years ago

I think we are all leaning towards NoSQL solution, but I don't want to put words in anyones mouth. That would be my opinion according to the requirements. As much as I am not a big fan of Mongo that may be the best solution. I don't have any exposure to DynamoDB but that sounds like an equally good solution. My only hesitation to Dynamo would be you would be tying the solution to AWS

Also I am guessing we are looking for something that could possible support an unknown number and type of use cases. Mongo seems to really become a bit of a standard and most tools and systems support some type of Mongo connector or integration. But if we do have a few specific use cases it may be good to outline them now to see if the solution easily helps accomplish them.

rachelanddata commented 7 years ago

My vote would be for MongoDB. I think it's important to also look at future use cases as well. Many use Hadoop for cheap storage/archiving as well as for any kind of ETL/batch processing - so if there's going to be any of that in the future, I would strongly recommend Hadoop with a MongoDB instance. Those familiar with SQL may also much rather prefer Hadoops Pig to Mongo's query language (especially for joins) so that could come in handy.

bstarling commented 7 years ago

Not a final decision but who has experience installing/configuring mongoDB? If we can get a group together I think it makes sense to do a trial run with a subset of the data. I can provide the AWS instance, just let me know what you need.

From this conversation I also gather we should keep the historic files raw files somewhere for future use/load to other tools. Given how cheap S3 (or even glacier) is I think this is definitely doable.

Great input everyone. Let's keep the discussion going.

FWIW I played around with DynamoDB briefly while it is nice AWS manages everything it does look like it could get expensive pretty fast. Probably overkill for our current requirements.

rachelanddata commented 7 years ago

I've installed and configured MongoDB on an Ubuntu VPS before. configured ssh from a local box into that Mongo instance to run MongoChef which is a nice GUI client to make querying easier for users. Not anywhere close to an expert though so could definitely use some help but I'm personally willing to give it a shot! :) Are you all planning to run this on a server or what would the preferred platform be?

On Sat, Jan 28, 2017 at 9:27 PM Benjamin Starling notifications@github.com wrote:

Not a final decision but who has experience installing/configuring mongoDB? If we can get a group together I think it makes sense to do a trial run with a subset of the data. I can provide the AWS instance, just let me know what you need.

From this conversation I also gather we should keep the historic files raw files somewhere for future use/load to other tools. Given how cheap S3 (or even glacier) is I think this is definitely doable.

Great input everyone. Let's keep the discussion going.

FWIW I played around with DynamoDB briefly while it is nice AWS manages everything it does look like it could get expensive pretty fast. Probably overkill for our current requirements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/assemble/issues/12#issuecomment-275889306, or mute the thread https://github.com/notifications/unsubscribe-auth/AXTgY6F09vxbdA9JquNo_n6G7779xcHlks5rW_kTgaJpZM4Lvtxt .

bstarling commented 7 years ago

Great thanks for @rachelanddata anyone else willing to lend a hand? @wwymak perhaps? I had assumed we would run it on an AWS EC2 instance but are there better options?

wwymak commented 7 years ago

yeah, I am more than happy to help set up and populate the relavant mongoDBs :) If our pipeline is mostly in AWS then it make sense to use AWS EC2. ( We can do something like use a lambda function trigger to run every time new data arrive to write to database) There are other options e.g. mongolabs if we really don't want to use AWS ;)

metame commented 7 years ago

While I'm not a big fan of mongo for production apps (although I happen to be a MongoDB Certified Developer :smile: ), I think this is a pretty good use case for it. And it would allow us to switch hosts if we ever get some donated infrastructure or something like that too. Not sure what my other time commitments will allow, but I'd like to help as possible.

rachelanddata commented 7 years ago

I'm at Disney on vacation until Tuesday so please don't let me hold anything up if you want to get this started right away, I'll catch up!

On Sun, Jan 29, 2017 at 12:01 PM Michael J Erwin notifications@github.com wrote:

While I'm not a big fan of mongo for production apps (although I happen to be a MongoDB Certified Developer 😄 ), I think this is a pretty good use case for it. And it would allow us to switch hosts if we ever get some donated infrastructure or something like that too. Not sure what my other time commitments will allow, but I'd like to help as possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/assemble/issues/12#issuecomment-275927997, or mute the thread https://github.com/notifications/unsubscribe-auth/AXTgY8DTHN2FR-KlU6AwIleaxBZLisruks5rXMYDgaJpZM4Lvtxt .

wwymak commented 7 years ago

How do we want to go about this? Maybe if @bstarling can set up a ec2 instance and we set up the mongodb, and we build a parser that saves one of the #far-right json files to the mongo? And also set up e.g. authentication? And then once we have a proof of concept we build a pipeline that automatically streams new data into the db?

ASRagab commented 7 years ago

I did a very rough proof of concept over in this branch in the discursive repo. It reads the index_twitter_stream that writes the file to s3/ES and dumps in a mongodb. My particular instance was a sandboxed env hosted in AWS but managed by mLab, they over free 0.5 GB sandboxes. As you can see below (hopefully) it looks like it parses the tweet fairly well, even getting the types right, I think (double check). I think maybe the next step is to write some kind of ETL out of S3 and into Mongo?

wwymak commented 7 years ago

are we going with just dumping everything from the tweet model into mongo ? If we are going to do that then I think what we should be saving to s3 is the whole tweet object rather than the model?

Or if we are doing any extra parsing on top of the tweet model to save to mongo?

In any case @bstarling has set up a ec2 instance and I have got mongo setup on the default port on it-- let @bstarling know so he can give ssh access and we can test out various processes?

bstarling commented 7 years ago

Re tweet model I still think it's best to let user define the data they want to save. For most purposes the main fields are fine, I think the only projects that would be using the d4d data store are the defined long term collection efforts. Ad hoc exploratory gathering can still go to csv/json/sqlite etc.

Separately, as part of the master plan (evil laugh) we are working on a configurable option which will dump the entire tweet to a local/s3 file.

mkleinert commented 7 years ago

@bstarling should we add a pipeline code in the scraping code for Mongo? Also should the solution be something like a docker container(s) or puppet/chef scripts? That way it can take off some of the work from any users may need to do to get this up and running.

bstarling commented 7 years ago

For the spiders we'll need a more central infrastructure. We should get to a point where we have deploy scripts but I don't know if we need to make it so friendly that anyone can create an instance. Definitely have the option of piping stuff straight into mongo using pipelines (I've already done it with DynamoDB). Whether we go that route or batch/file load later is still TBD.

kgorman commented 7 years ago

Hey guys, @gati pinged me on this, I thought I would chime in a bit, hopefully helpful.

First and foremost I am excited to help out with the project in any capacity. Good work this.

My company (http://www.eventador.io/) is happy to donate a production quality cluster to the cause. I suspect it will more than handle both archive/query as well as provide easy real-time capabilities that were mentioned above. We have notebooks built in, a SQL backend (or MongoDB or Elastic if you want).

To be quite honest, we would love to partner so we can get more real-world use cases and you guys can stretch our thinking and make us better at what we do. We are passionate about data, and this project is awesome. Hopefully win/win.

We are a real-time data processing platform, which, by design, is a super-set of just simply storing data. To start, you can immediately start putting data into the front door in JSON format, and it would flow into PostgreSQL/PipelineDB. If you really wanted MongoDB we could make that happen. We have a history with it. ;-). We have built in Jupyter notebooks as well. Also, it's secured via IP whitelist, no open-door MongoDB mistakes. We are currently adding Apache Storm computability to our platform, so you guys can use that too.

That said, if the relationship didn't work out, the data is in MongoDB format or whatever and you can simply more it somewhere else. The project owns the data as any customer would.

So if you guys think it's a fit then we are game. If not, no worries, I still want to help!

bstarling commented 7 years ago

Taking @kgorman up on this awesome offer. Join us in #eventador if you'd like to participate.

bstarling commented 7 years ago

Closing this as it looks like Eventador is our path forward.

Data4Democracy / assemble