StephenOTT commented 7 years ago

https://travel.gc.ca/returning/border-times

https://travel.gc.ca/travelling/border-times-us

StephenOTT commented 7 years ago

Screenshot captures

fireshot capture 197 - canada to u s border wait_ - https___travel gc ca_travelling_border-times-us fireshot capture 198 - travel gc ca - u s to canada _ - https___travel gc ca_returning_border-times

StephenOTT commented 7 years ago

Considerations:

Text in boxes is likely freeform
Where does the come from? Is border crossings adding the content directly? or is flowed through a central office who reads it from another system? calls the border crossings?

StephenOTT commented 7 years ago

Open Data source

http://open.canada.ca/data/en/dataset/d4a716f5-a2fc-4c3c-88ed-451fe05900e4

See comments that outline the US to Canada wait times are coming from the US controlled website.

[ ] look into how the Us to Canada wait times are actually updated on the Canada.ca controlled website.

CSV parsing: http://papaparse.com

CSV Parsing issues (updated: Nov 8 2017)

The CSV files that are outputted in Open data are horrible.
uses semi-colons. but does not have quotes around the headers or values.
Every row has a extra semi-colon at the end
All rows except for the first row have a extra space at the end before the line break
There are Double Semi-colons because it looks like whoever creates the Semi-colon file added extra columns between each column...
🔥 The times are in non-standard timezone formats: example: 2017-11-08 17:48 AST. where the AST is not commonly used by parsing libraries because it is ambiguous (is AST stand for Atlantic or Alaska?)

Solutions

Used the following regex and JS to clean the CSV:

load('https://cdnjs.cloudflare.com/ajax/libs/PapaParse/4.3.6/papaparse.min.js');

function removeDoubleSemicolons(csvString){
  var regex = /(;;[ ]?)/g;
  var subst = ';';
  var data = csvString.replace(regex, subst);
  return data;
}

function removeLineEndingSemicolons(csvString){
  var regex = /;$/gm;
  var subst = '';
  var data = csvString.replace(regex, subst);
  return data;
}

var csvString = response;
var pass1 = removeDoubleSemicolons(csvString);
var pass2 = removeLineEndingSemicolons(pass1);

var json = Papa.parse(pass2, {
  "header": true, 
  "delimiter": ";", 
  "skipEmptyLines": true
  });

connector.setVariable('borderWaitTimes', S(JSON.stringify(json.data)));

S(JSON.stringify(json));

StephenOTT commented 7 years ago

HTML Parsing

using jSoup:

var html = execution.getVariable('htmlResponse').prop('html-response').value();

with (new JavaImporter(org.jsoup)) {
  var htmlJsoup = Jsoup.parse(html);

  htmlJsoup.title();

}

StephenOTT commented 7 years ago

Duration parsing

https://github.com/domchristie/juration

https://cdn.rawgit.com/domchristie/juration/master/juration.js

StephenOTT commented 7 years ago

BPMN

v0.1 canada-usa-border-wait-times

StephenOTT commented 7 years ago

HTML parsing

used Jsoup

function getUrlAsXhtmlString(url)
{
  with (new JavaImporter(org.jsoup))
  {
    var doc = Jsoup.connect(url).get();
    doc.outputSettings().syntax(Java.type("org.jsoup.nodes.Document.OutputSettings.Syntax").xml);
    var docString = doc.toString();

    return docString;
  }
}

function generateSpinVariables(xHtmlString)
{
  var htmlSpin = S(docString);
  execution.setVariable('html', htmlSpin);
}

function scrape(url)
{
  var xHtmlString = getUrlAsXhtmlString(url);
  generateSpinVariables(xHtmlString);
}

scrape('http://www2.nrcan-rncan.gc.ca/dc-dpm/index.cfm?fuseaction=r.q&lang=eng');

Special note:

doc.outputSettings().syntax(Java.type("org.jsoup.nodes.Document.OutputSettings.Syntax").xml);

This required a java enum which required a Forcing of a Type: Java.type("org.jsoup.nodes.Document.OutputSettings.Syntax").xml. Based this off of: https://stackoverflow.com/a/29039163 and https://jsoup.org/apidocs/org/jsoup/nodes/Document.OutputSettings.Syntax.html, and https://stackoverflow.com/a/29087437

StephenOTT commented 7 years ago

xPath

To review: https://github.com/camunda/camunda-spin/issues/16#issuecomment-319944327

Xpath query with Camunda SPIN: https://docs.camunda.org/manual/7.7/reference/spin/xml/04-querying-xml/

DigitalState / WhatsTheWait

Canada/USA Border Wait Times #9

Considerations:

Open Data source

CSV Parsing issues (updated: Nov 8 2017)

Solutions

HTML Parsing

Duration parsing

BPMN

HTML parsing

xPath