Rillus / walthamstuff-wp

Walthamstuff Wordpress repo
Other
0 stars 0 forks source link

Scrape / Ingest events from known websites #1

Open JohnField opened 6 years ago

JohnField commented 6 years ago

Thanks again for the Waltham Forest Hackathon last night! Here's some notes / code form our team for going forward:

Obviously this is hacky (clue's in the name) and security would be an issue. Improved ways to do this include making a page in wordpress that handles the logic (e.g. https://wordpress.org/plugins/insert-php/ ) or an "event create" form that Flask posts to ( https://www.inkthemes.com/how-you-can-easily-create-customized-form-in-wordpress/ ). In this way, WP can also check if the request comes from a known user and perform other application logic rather than a direct SQL insertion.

JohnField commented 6 years ago

This is the docker-compose.yml we used to spin up a local instance of wordpress (as well as python / mongoDB, but don't think that was used).

This was useful for working on a local copy - if you're happy to test this directly against the live site, it's not needed.

See https://docs.docker.com/compose/wordpress/ for inspiration and Docker docs. Once you've started the containers, visiting localhost:8000 takes you to a new WP, which will run through the install (creating a new DB) and then, because you are using the walthamstuff codebase, install needed plugins.

version: "3.1"

services:
   wordpress:
     depends_on:
       - mysql
     image: wordpress:latest
     container_name: "wordpress"
     volumes:
       # symlink wordpress_files to a local checkout of https://github.com/Rillus/walthamstuff-wp
       # docker-compose exec wordpress bash
       - .wordpress_files:/var/www/html
     ports:
       - "8000:80"
     restart: always
     environment:
       WORDPRESS_DB_HOST: mysql:3306
       WORDPRESS_DB_USER: wordpress
       WORDPRESS_DB_PASSWORD: wordpress

   mysql:
     image: mysql:5.7
     container_name: "mysql"
     volumes:
       - mysql_data:/var/lib/mysql
     ports:
     - 33306:3306
     restart: always
     environment:
       MYSQL_ROOT_PASSWORD: wordpress
       MYSQL_DATABASE: wordpress
       MYSQL_USER: wordpress
       MYSQL_PASSWORD: wordpress

   web:
     image: python:3.5
     container_name: "web"
     #build: .
     command: python -u app.py
     ports:
       - "5000:5000"
     volumes:
       - .:/todo
     links:
       - mongodb

   mongodb:
    image: mongo:3.5
    container_name: "mongodb"
    environment:
     - MONGO_DATA_DIR=/data/db
     - MONGO_LOG_DIR=/dev/null
    volumes:
     - .mongo_data:/data/db
    ports:
     - 27017:27017
    command: mongod --smallfiles --logpath=/dev/null # --quiet

volumes:
    mysql_data:
    mongo_data:
JohnField commented 6 years ago

These are real examples of the SQL tables that are used. The events plugin creates a WP blog post first, then info about the event in a custom table that is related to the post by its ID.

One useful trick here is to set the post to draft and in the future - WP's auto-publishing function then picks it up and publishes it, sends it to Buffer, etc. This can be useful for someone (Riley?) to review / delete draft events before they go live. There's also a GUID field that could be set by the importer for duplicate post detection.

# Inject a WP post. Note:
# the hardcoded post ID
# the post type is event
# has a draft status and post_date in the future (so when WP publishes it, it handles sending to Buffer)
INSERT INTO `wordpress`.`wp_posts`
(`ID`,
`post_author`,
`post_date`,
`post_date_gmt`,
`post_content`,
`post_title`,
`post_excerpt`,
`post_status`,
`comment_status`,
`ping_status`,
`post_password`,
`post_name`,
`to_ping`,
`pinged`,
`post_modified`,
`post_modified_gmt`,
`post_content_filtered`,
`post_parent`,
`guid`,
`menu_order`,
`post_type`,
`post_mime_type`,
`comment_count`)
VALUES (100,1,'2017-12-01 21:03:24','2017-12-01 21:03:24','injected_item','injected_item','','draft','open','closed','','baz','','','2017-12-17 21:03:24','2017-11-17 21:03:24','',0,'http://localhost:8000/?post_type=event&p=11',0,'event','',0);

# Inject the event that builds on the post
# Event Attributes can be wp_postmeta values, e.g. a:1:{s:22:"al2fb_facebook_exclude";s:1:"1";}
INSERT INTO `wordpress`.`wp_em_events`
(`event_id`,
`post_id`,
`event_slug`,
`event_owner`,
`event_status`,
`event_name`,
`event_start_time`,
`event_end_time`,
`event_all_day`,
`event_start_date`,
`event_end_date`,
`post_content`,
`event_rsvp`,
`event_rsvp_date`,
`event_rsvp_time`,
`event_rsvp_spaces`,
`event_spaces`,
`event_private`,
`location_id`,
`recurrence_id`,
`event_category_id`,
`event_attributes`,
`event_date_created`,
`event_date_modified`,
`recurrence`,
`recurrence_interval`,
`recurrence_freq`,
`recurrence_byday`,
`recurrence_byweekno`,
`recurrence_days`,
`recurrence_rsvp_days`,
`blog_id`,
`group_id`)
 VALUES (5,100,'injected_item',1,NULL,'injected_item','00:00:00','00:00:00',0,NULL,NULL,'quux',0,NULL,'00:00:00',NULL,0,0,NULL,NULL,NULL,'a:0:{}','2017-11-17 21:03:24',NULL,0,NULL,NULL,NULL,NULL,0,NULL,NULL,0);
JohnField commented 6 years ago

Events can also have a location. These seem to be custom to the events plugin and are created with a one-to-one relationship to the event ID / blog ID.

# Set the location_id to 0 if no location
# Or, create a location in wp_em_locations and then relate to it one-to-one, e.g.

INSERT INTO `wordpress`.`wp_em_locations`
(`location_id`,
`post_id`,
`blog_id`,
`location_slug`,
`location_name`,
`location_owner`,
`location_address`,
`location_town`,
`location_state`,
`location_postcode`,
`location_region`,
`location_country`,
`location_latitude`,
`location_longitude`,
`post_content`,
`location_status`,
`location_private`)
VALUES
('1', '101', '0', NULL, 'walthamstow', '1', NULL, NULL, NULL, 'E17', NULL, 'GB', '51.590176', '-0.017344', NULL, NULL, '0');
JohnField commented 6 years ago

And, here's the PHP script written about midnight - call this wp-hackathon.php, put it in the root of your directory, and make a call against it. Again, not good security.

<?php
/**
 * WF Hackathon
 * Accept a JSON payload and insert a post based on it! Mmm, secure.
 */

if (!empty($_POST)) {
    die('Invalid request');
}

// This could be a header.
if ($_GET['md5-hash']!=='foo') {
    die('Invalid request');
}

//Get our body.
$entityBody = json_decode(file_get_contents('php://input'));
if (!$entityBody) {
    die('Invalid request');
}

// Well, after our rigorous validation, post the item.
// https://codex.wordpress.org/Class_Reference/wpdb
require_once('wp-config.php');
require_once('wp-load.php');
global $wpdb;

// First, dupe detection.

$post_id = $wpdb->get_var( "SELECT id FROM {$wpdb->prefix}posts WHERE guid = ".  $_POST['guid']);
if ($post_id) {
    die('Post already exists');
}

print_r('Making post');
//print_r($entityBody);

$postBody = array(
    'post_author' => 1,//$_POST['post_author,
    'post_date'=> $entityBody->post_date,
    'post_date_gmt'=> $entityBody->post_date_gmt,
    'post_content'=> $entityBody->post_content,
    'post_title'=> $entityBody->post_title,
    'post_status'=> $entityBody->post_status,
    'post_modified'=> $entityBody->post_modified,
    'post_modified_gmt'=> $entityBody->post_modified_gmt,
    'post_excerpt'=> $entityBody->post_excerpt,
    'post_content_filtered'=> $entityBody->post_content_filtered,
    'to_ping'=> $entityBody->to_ping,
    'pinged'=> $entityBody->pinged,
    'guid'=> $entityBody->guid,
    'post_type'=> 'event'
);

//print_r($postBody);
$affected_rows = $wpdb->insert( $wpdb->prefix . "posts",$postBody);

if ($affected_rows !=1) {
    die('Error creating post - '.  $wpdb->last_error);
}
$post_id = $wpdb->insert_id;
print_r("Created post $post_id");
// Event has very few required fields, suppress errors if not exist.
$postBody = array(
    'post_id' => $post_id,
    'event_slug'=> $entityBody->event_slug,
    'event_name' => $entityBody->event_name,
    'event_start_date' => @$entityBody->event_start_date,
    'event_end_date' => @$entityBody->event_end_date
);
//print_r($postBody);
$affected_rows = $wpdb->insert( $wpdb->prefix . "em_events",$postBody);

if ($affected_rows !=1) {
    die('Error creating event - '.  $wpdb->last_error);
}
$event_id = $wpdb->insert_id;
print_r("Created event $event_id");

//We're done!
die("Created event - post ID  $post_id");

You can then make a super-secure POST to it with e.g.

curl -X POST \
  'http://localhost:8000/wp-hackathon.php?md5-hash=foo' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
            "post_date": "2017-11-17",
            "post_date_gmt": "2017-11-17",
            "post_content": "foo",
            "post_title": "foo",
            "post_status": "foo",
            "post_modified": "2017-11-17",
            "post_modified_gmt": "2017-11-17 00:00:01",
            "post_excerpt": "foo",
            "post_content_filtered": "foo",
            "guid": "foobarbaz",
            "to_ping": "foo",
            "pinged": "foo",
            "event_slug": "baz",
            "event_name": "baz",
            "event_start_date": "2017-11-17",
            "event_end_date":"2017-11-17"
          }'

If there's a post with the same GUID, it assumes it's a duplicate and ignores (GUID prepared by the scraper and is a VARCHAR(255) field)

Obvious improvements include: