Closed RexYuan closed 8 years ago
Following nusphere tutorial on the tidy library. Preparations: brew install php56-tidy
Following nusphere tutorial on simplexml extension. Preparations: None, SimpleXML extension is pre-installed and enabled by default
Following W3school tutorial on XPath Preparations: Basic knowledge of HTML, XML [, XSL family[, DOM[, history of web, e.g., W3C]]]
I've figured out the technology. Per some sample code:
<?php
$source = "http://rexyuan.github.io/Rex-one-page-DIY/";//"./source.html";
$config = ["indent" => true,
"output-xhtml" => true,
"numeric-entities" => true];
$encoding = "utf8";
$tidy = new tidy;
$tidy->parseFile($source, $config, $encoding);
$tidy->cleanRepair();
$dom = new SimpleXMLElement($tidy);
$dom->registerXPathNamespace("xhtml", "http://www.w3.org/1999/xhtml");
//file_put_contents("./source.html", $dom->asXML());
$r = $dom->xpath("/xhtml:html/xhtml:head/xhtml:title");
print(trim((string)$r[0]));
?>
Found two bugs causing errors in SimpleXMLElement namespace parsing:
Try fixing bug 2. first by stripping with regex Following RegexOne's tutorial
Regex solution to bug 2. See: Regex101 test
/\<o:p\>(.*)<\/o:p>/s
As an aside, found some information regarding the
Regex solution to bug 1. See: Regex101 test
/(mso(.*?);)/sg
Note that g(global flag) is not available in php regex. Solution is to use a loop with condition being preg_match and doing preg_replace iteratively without the g flag
New namespace bugs popping out - reasons unknow A little excerpt of warning message:
Warning: SimpleXMLElement::__construct(): ^ in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning: SimpleXMLElement::__construct(): namespace error : Failed to parse QName 'margin-bottom:' in /Users/Rex/Desktop/scrap.php on line 28
Warning: SimpleXMLElement::__construct(): namespace error : Failed to parse QName 'margin-bottom:' in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning: SimpleXMLElement::__construct(): margin-bottom:=""> in /Users/Rex/Desktop/scrap.php on line 28
Warning: SimpleXMLElement::__construct(): margin-bottom:=""> in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning: SimpleXMLElement::__construct(): ^ in /Users/Rex/Desktop/scrap.php on line 28
Further investigation is required
I've fixed it! Fuck yeah! I'll include details later
So basically I didn't fix the mess Word makes. I used regex to completely strip out the part which may contain the Word mess, 「二、教學大綱」 section. The regex I used is
/(\s*)<table width="900">\n(\s*)<tr>\n(\s*)<td\salign="left">\n(\s*)<b>二、教學大綱<\/b>\n(\s*)<\/td>\n(\s*)<\/tr>\n(\s*)<\/table>\n(.*)\n(\s*)<\/table><br\s\/>\n(\s*)<br\s\/>/s
The testing of the 30 erroneous course pages from last running of the scraper passed without any problem. There seems to be no major/obvious bug at this moment right now.
Here's the code for testing:
$error_classes = [1 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=01UG004&courseGroup=B&deptCode=GU&formS=&classes1=&deptGroup",
2 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=02UG001&courseGroup=F&deptCode=GU&formS=&classes1=&deptGroup",
3 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=03UG017&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
4 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=03UG024&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
5 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=04UG007&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
6 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=04UG017&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
7 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG001&courseGroup=C&deptCode=GU&formS=&classes1=&deptGroup",
8 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG007&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
9 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG011&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
10 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG012&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
11 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0AUG418&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
12 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0AUG426&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
13 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG223&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
14 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG502&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
15 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG640&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
16 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0NUG246&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
17 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0SUG514&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
18 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0SUG523&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
19 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=BIU0177&courseGroup=A&deptCode=SU43&formS=&classes1=&deptGroup",
20 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=BIU0178&courseGroup=A&deptCode=SU43&formS=3&classes1=&deptGroup",
21 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0002&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
22 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0006&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
23 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0364&courseGroup=&deptCode=EU07&formS=3&classes1=1&deptGroup",
24 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0366&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
25 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0183&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
26 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0250&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
27 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0286&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
28 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CLU0026&courseGroup=&deptCode=IU84&formS=4&classes1=&deptGroup",
29 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=EAU0184&courseGroup=&deptCode=IU83&formS=1&classes1=&deptGroup",
30 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=EDU0002&courseGroup=&deptCode=EU00&formS=2&classes1=1&deptGroup"];
foreach ($error_classes as $index => $test_class)
{
$source = $test_class;
$config = ["output-xhtml" => true,
"indent" => true,
"indent-attributes" => true,
"numeric-entities" => true,
"bare" => true,
"clean" => true,
"word-2000" => true,
"wrap" => 0];
$encoding = "utf8";
$tidy = new tidy;
$tidy -> parseFile($source, $config, $encoding);
$tidy -> cleanRepair();
$xhtml = preg_replace('/(\s*)<table width="900">\n(\s*)<tr>\n(\s*)<td\salign="left">\n(\s*)<b>二、教學大綱<\/b>\n(\s*)<\/td>\n(\s*)<\/tr>\n(\s*)<\/table>\n(.*)\n(\s*)<\/table><br\s\/>\n(\s*)<br\s\/>/s', '', (string)$tidy);
$dom = new SimpleXMLElement($xhtml);
$dom->registerXPathNamespace("xhtml", "http://www.w3.org/1999/xhtml");
$r = $dom->xpath("/xhtml:html/xhtml:body/xhtml:div/xhtml:table[4]/xhtml:tr[1]/xhtml:td[4]");
echo $index.trim((string)$r[0]);
};
I'm closing this issue because the bug is solved (albeit by duct taping). Further optimization for the communication with database is still required for the this run of update.
1: The combination is insurmountable if the approach is checking all possibilities(big O of n): 9999(code) * 27(course group) * 5(grade) * 12(class) * 27(dept group) = 437356260. Without changing the fundemental method(not changing the asymptotic size of the time of solution), I tried to optimize the solution by ignoring rare cases(such as course group of more than 5, class more than 2, or completely ignore dept group).
2: For unknown reason, there may be some warnings or errors when processing some page when converting them into xhtml. the warning may be a bunch of "Warning: simplexml_load_string(): ..." sometimes the warnings may be followed by a fatal: "Fatal error: Call to a member function registerXPathNamespace() on a non-object"