RexYuan / courseNTNU

(Discontinued) NTNU course rating catalog.
The Unlicense
1 stars 0 forks source link

Scraper Redesign #20

Closed RexYuan closed 8 years ago

RexYuan commented 8 years ago

基於一項由 @jaidTw 的重大發現,讓更多資料以及更有效率的 scraping 變為可能;而因此原先的暴力法(見 URL GET Request Components Guide 中的 p.s. 1)已經不再需要。

學校的系統似乎是先以JSON出資料,再以某些技術來呈現。新的方式將會是直接從選課系統或是校務行政系統中的課程查詢獲取資料,範例:104學年上學期的資工系,由選課系統得到的資料和由開課查詢系統查詢得到的。兩者間獲得的資料相同,但是選課系統的 URL 明顯較長,因此使用校務行政系統將是較優化的選擇。

一段從剛剛得到的資料的擷取:

{"acadmTerm":"1","acadmYear":"104","authorizeP":20,"chnName":"程式設計(一)","class1":"","courseCode":"CSU0001","courseGroup":"","courseKind":"半","credit":"3.0","deptCode":"SU47","deptGroup":"","engTeach":"否","formS":"1","insDeptCode":"SU47","limitCountH":50,"moocs":"N","optionCode":"必修","serialNo":"3025","sex_restrict":"","teacher":"蔣宗哲","timeInfo":"三 8-9 公館 理圖807,五 7 公館 理圖807,","v_chn_name":"程式設計(一)","v_class1":"","v_comment":"","v_deptChiabbr":"資工系","v_deptGroup":"","v_error":"","v_is_Full":"","v_limitCourse":"","v_phase":"","v_priority":0,"v_release_time":"","v_reserve_count":0,"v_stage":0,"v_stfseld":0,"v_stfseld_auth":0,"v_stfseld_deal":0,"v_stfseld_exchange":0,"v_stfseld_undeal":0,"v_stfseld_unfull":0},
{"acadmTerm":"1","acadmYear":"104","authorizeP":20,"chnName":"計算機概論","class1":"","courseCode":"CSU0006","courseGroup":"","courseKind":"半","credit":"3.0","deptCode":"SU47","deptGroup":"","engTeach":"否","formS":"1","insDeptCode":"SU47","limitCountH":50,"moocs":"N","optionCode":"必修","serialNo":"3026","sex_restrict":"","teacher":"林順喜","timeInfo":"一 2 公館 B102,四 3-4 公館 B102,","v_chn_name":"計算機概論","v_class1":"","v_comment":"","v_deptChiabbr":"資工系","v_deptGroup":"","v_error":"","v_is_Full":"","v_limitCourse":"","v_phase":"","v_priority":0,"v_release_time":"","v_reserve_count":0,"v_stage":0,"v_stfseld":0,"v_stfseld_auth":0,"v_stfseld_deal":0,"v_stfseld_exchange":0,"v_stfseld_undeal":0,"v_stfseld_unfull":0}

已經可以從中看出這同時解決了 #15 ,並且開啟了非常多其他的可能性,如:模擬選課等。

只要嘗試改變 deptCode 就能改變要求的系所,明顯證明該方法可行性,而所有的這些對應系所的變數皆已存在 department 這個 table 裡。

由於這項發現,原有的 scraper 已經不再被需要,而所需的時間也將極大化地被縮減,我將會在今天或明天寫出一份簡單的範例 scraper,和一個簡單的 JSON 解讀手冊,以便搭配 #19 重新設計資料庫的進行。

RexYuan commented 8 years ago

New Scraping Guide:尚未完全解析的手冊 new_scrap.php:尚未加上儲存資料庫的 scraper

RexYuan commented 8 years ago

目前在 67bcb3ccf4439160763b58cb638c57af327bd001,基礎的 scraper 版本已經完成。只差之後依照資料庫結構的設計 #19 進行微調的動作。

jaidTw commented 8 years ago

關於Scraping Guide所提到的系所代碼Request部分,type參數為回傳資料使用之語言, 如:type=chn代表要求回傳資料為中文名稱,若更改為type=eng則回傳資料為系所英文名稱。

jaidTw commented 8 years ago

系所之課程清單Request部分,剩餘未知項目茲分析如下:

RexYuan commented 8 years ago

補充:_dc 參數為 Ext.js 的 disableCaching 功能的結果。目的是為了讓伺服器避免 GET Request 的 cache,詳細閱讀:Removing _dc parameter in ExtGET、POST與cache的關係Caching in HTTP。 更深入的研究發現到其參數為以毫秒為單位的 UNIX POSIX Time(和 Python 的 time.time() 回傳值一樣)。範例中的 dc=1439021294652 代表的就是 8/8/2015, 4:08:14 PM GMT+8:00。

jaidTw commented 8 years ago

關於查詢課程回應的JSON參數意義,經由CofopdlCtrl結果分析如下

以下是只會在英文版網頁用到的資訊:

以下參數目前未見使用:

jaidTw commented 8 years ago

Scraper目前來看需要分為兩個部分 第一個部分是抓取課程資訊,只需要在課程公告後執行一次, 第二個部分是抓取即時的選課人數資訊,在選課期間必須定期更新。

RexYuan commented 8 years ago

I'm closing this issue because, as of c04dec49efdfe5dd01c2d65433f2cb3317cc3c6f, we've achieved a working prototypical scraper. The rest of potential enhancement and adjustment for the second part of what $jaidTw mentioned will be resolved at #25.