Closed adamdecaf closed 8 years ago
From previous projects
* Data Topics ** Incoming *** projects **** metrics lib ***** timing metrics ****** Timing("some.key.thing") { body } **** graphite, grafana deployable ***** store dashboards **** logstash, kibana, elasticsearch **** http layer ***** store results in s3 and postgres ***** batches req/resp ***** non blocking ***** hash http request / response ids when stored as s3 keys **** s3 interface **** extraction layer(s) ***** xpath, regex ***** email, phone numbers, addresses, names, etc ***** postgres tables for scraped data, (scraped_emails) ***** cleaner tables for post-processed / reconciled data ****** graph problem? to find connected data **** actors lib ***** w/ common patterns **** pre-populating url layer **** api for viewing some basic stats / search? ***** will need SSL in front of it, if it's public in any way. **** encryption lib ***** health check on app start to ensure we can crypt **** compression lib **** basebox setup for aws nodes **** aws nodes, RDS setup ***** put it all inside a VPC? **** deploy project ***** w/ scripts to setup and start dbs and apps ***** check that the correct number are running ***** holds rsa keys etc ***** holds "production.conf" that are primary config for each app. **** shared logging volumes **** private docker hub *** http://www.whitehouse.gov/open ** People *** Names (name_id) *** Aliases (alias_id, name_id) *** Stats (gender: m, f, o, sex: m, f, o, age) *** Locations (street, street 2, city, state, zip, country, territory, etc.) *** Relatives *** Friends *** Work History ** Places ** Crimes *** Locations (place_id) ** Writings ** Animals ** Logins / Profile *** profile_id, Username, email, password ** Images *** profile_id (optional) ** Companies ** Meta-Project *** codahale metrics setup *** health checks setup *** layers **** http **** encryption **** compression *** deployables **** website crawler ***** conical urls ***** versioning of request / response ***** storage of request / response in S3 **** api crawler ***** versioning of request / response ***** storage of request / response in S3 **** html extraction (xpath, regex) ***** js, css file extraction ***** link extraction ****** feeds back into api / web crawlers * Services / Tools ** https://github.com/begriffs/postgrest * Deployment ** Java *** Roll gc logs: https://jyates.github.io/2012/11/05/rolling-java-gc-logs.html * Databases ** CANCELLED sort out differences between CLOSED: [2015-01-17 Sat 12:27] *** foundationdb **** https://foundationdb.com/ *** postgres **** http://www.postgresql.org/ *** neo4j **** http://neo4j.com/ *** titan **** https://github.com/thinkaurelius/titan **** https://thinkaurelius.github.io/titan/ *** rocksdb **** https://github.com/facebook/rocksdb/wiki *** redis **** http://redis.io *** flockb **** https://github.com/twitter/flockdb *** MapDB **** https://github.com/jankotek/MapDB
useful links:
https://github.com/begriffs/postgrest https://jyates.github.io/2012/11/05/rolling-java-gc-logs.html http://neo4j.com/
https://github.com/thinkaurelius/titan https://github.com/facebook/rocksdb
https://github.com/twitter/flockdb https://github.com/jankotek/MapDB
From previous projects