RexYuan / courseNTNU

(Discontinued) NTNU course rating catalog.
The Unlicense
1 stars 0 forks source link

Scraper Problems #5

Closed RexYuan closed 8 years ago

RexYuan commented 9 years ago

1: The combination is insurmountable if the approach is checking all possibilities(big O of n): 9999(code) * 27(course group) * 5(grade) * 12(class) * 27(dept group) = 437356260. Without changing the fundemental method(not changing the asymptotic size of the time of solution), I tried to optimize the solution by ignoring rare cases(such as course group of more than 5, class more than 2, or completely ignore dept group).

2: For unknown reason, there may be some warnings or errors when processing some page when converting them into xhtml. the warning may be a bunch of "Warning: simplexml_load_string(): ..." sometimes the warnings may be followed by a fatal: "Fatal error: Call to a member function registerXPathNamespace() on a non-object"

RexYuan commented 8 years ago

Following nusphere tutorial on the tidy library. Preparations: brew install php56-tidy

RexYuan commented 8 years ago

Following nusphere tutorial on simplexml extension. Preparations: None, SimpleXML extension is pre-installed and enabled by default

RexYuan commented 8 years ago

Following W3school tutorial on XPath Preparations: Basic knowledge of HTML, XML [, XSL family[, DOM[, history of web, e.g., W3C]]]

RexYuan commented 8 years ago

I've figured out the technology. Per some sample code:

<?php

  $source   = "http://rexyuan.github.io/Rex-one-page-DIY/";//"./source.html";
  $config   = ["indent"           => true,
               "output-xhtml"     => true,
               "numeric-entities" => true];
  $encoding = "utf8";
  $tidy = new tidy;
  $tidy->parseFile($source, $config, $encoding);
  $tidy->cleanRepair();

  $dom = new SimpleXMLElement($tidy);
  $dom->registerXPathNamespace("xhtml", "http://www.w3.org/1999/xhtml");
  //file_put_contents("./source.html", $dom->asXML());
  $r = $dom->xpath("/xhtml:html/xhtml:head/xhtml:title");
  print(trim((string)$r[0]));

?>
RexYuan commented 8 years ago

Error example:音樂鑑賞 Debugging

RexYuan commented 8 years ago

Found two bugs causing errors in SimpleXMLElement namespace parsing:

  1. MSO(Microsoft Word) font tags
  2. "" tag of unknown origin
RexYuan commented 8 years ago

Try fixing bug 2. first by stripping with regex Following RegexOne's tutorial

RexYuan commented 8 years ago

Regex solution to bug 2. See: Regex101 test

/\<o:p\>(.*)<\/o:p>/s

As an aside, found some information regarding the tag. Seems to be another Microsoft Word tag

RexYuan commented 8 years ago

Regex solution to bug 1. See: Regex101 test

/(mso(.*?);)/sg

Note that g(global flag) is not available in php regex. Solution is to use a loop with condition being preg_match and doing preg_replace iteratively without the g flag

RexYuan commented 8 years ago

New namespace bugs popping out - reasons unknow A little excerpt of warning message:

Warning: SimpleXMLElement::__construct():                                   ^ in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning:  SimpleXMLElement::__construct(): namespace error : Failed to parse QName 'margin-bottom:' in /Users/Rex/Desktop/scrap.php on line 28

Warning: SimpleXMLElement::__construct(): namespace error : Failed to parse QName 'margin-bottom:' in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning:  SimpleXMLElement::__construct():                        margin-bottom:=""> in /Users/Rex/Desktop/scrap.php on line 28

Warning: SimpleXMLElement::__construct():                        margin-bottom:=""> in /Users/Rex/Desktop/scrap.php on line 28
PHP Warning:  SimpleXMLElement::__construct():                                      ^ in /Users/Rex/Desktop/scrap.php on line 28

Further investigation is required

RexYuan commented 8 years ago

I've fixed it! Fuck yeah! I'll include details later

RexYuan commented 8 years ago

So basically I didn't fix the mess Word makes. I used regex to completely strip out the part which may contain the Word mess, 「二、教學大綱」 section. The regex I used is

/(\s*)<table width="900">\n(\s*)<tr>\n(\s*)<td\salign="left">\n(\s*)<b>二、教學大綱<\/b>\n(\s*)<\/td>\n(\s*)<\/tr>\n(\s*)<\/table>\n(.*)\n(\s*)<\/table><br\s\/>\n(\s*)<br\s\/>/s

The testing of the 30 erroneous course pages from last running of the scraper passed without any problem. There seems to be no major/obvious bug at this moment right now.

testing log

Here's the code for testing:

  $error_classes = [1  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=01UG004&courseGroup=B&deptCode=GU&formS=&classes1=&deptGroup",
                    2  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=02UG001&courseGroup=F&deptCode=GU&formS=&classes1=&deptGroup",
                    3  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=03UG017&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    4  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=03UG024&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
                    5  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=04UG007&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    6  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=04UG017&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    7  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG001&courseGroup=C&deptCode=GU&formS=&classes1=&deptGroup",
                    8  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG007&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    9  => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG011&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
                    10 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=05UG012&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
                    11 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0AUG418&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    12 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0AUG426&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    13 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG223&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    14 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG502&courseGroup=A&deptCode=GU&formS=&classes1=&deptGroup",
                    15 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0HUG640&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    16 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0NUG246&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    17 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0SUG514&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    18 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=0SUG523&courseGroup=&deptCode=GU&formS=&classes1=&deptGroup",
                    19 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=BIU0177&courseGroup=A&deptCode=SU43&formS=&classes1=&deptGroup",
                    20 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=BIU0178&courseGroup=A&deptCode=SU43&formS=3&classes1=&deptGroup",
                    21 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0002&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
                    22 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0006&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
                    23 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0364&courseGroup=&deptCode=EU07&formS=3&classes1=1&deptGroup",
                    24 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CEU0366&courseGroup=&deptCode=EU07&formS=1&classes1=1&deptGroup",
                    25 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0183&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
                    26 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0250&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
                    27 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CHU0286&courseGroup=&deptCode=LU20&formS=1&classes1=&deptGroup",
                    28 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=CLU0026&courseGroup=&deptCode=IU84&formS=4&classes1=&deptGroup",
                    29 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=EAU0184&courseGroup=&deptCode=IU83&formS=1&classes1=&deptGroup",
                    30 => "http://courseap.itc.ntnu.edu.tw/acadmOpenCourse/SyllabusCtrl?year=103&term=2&courseCode=EDU0002&courseGroup=&deptCode=EU00&formS=2&classes1=1&deptGroup"];
  foreach ($error_classes as $index => $test_class)
  {
    $source   = $test_class;
    $config   = ["output-xhtml"      => true,
                 "indent"            => true,
                 "indent-attributes" => true,
                 "numeric-entities"  => true,
                 "bare"              => true,
                 "clean"             => true,
                 "word-2000"         => true,
                 "wrap"              => 0];
    $encoding =  "utf8";
    $tidy     =  new tidy;
    $tidy     -> parseFile($source, $config, $encoding);
    $tidy     -> cleanRepair();
    $xhtml = preg_replace('/(\s*)<table width="900">\n(\s*)<tr>\n(\s*)<td\salign="left">\n(\s*)<b>二、教學大綱<\/b>\n(\s*)<\/td>\n(\s*)<\/tr>\n(\s*)<\/table>\n(.*)\n(\s*)<\/table><br\s\/>\n(\s*)<br\s\/>/s', '', (string)$tidy);
    $dom = new SimpleXMLElement($xhtml);
    $dom->registerXPathNamespace("xhtml", "http://www.w3.org/1999/xhtml");
    $r = $dom->xpath("/xhtml:html/xhtml:body/xhtml:div/xhtml:table[4]/xhtml:tr[1]/xhtml:td[4]");
    echo $index.trim((string)$r[0]);
  };

I'm closing this issue because the bug is solved (albeit by duct taping). Further optimization for the communication with database is still required for the this run of update.