Scraper Redesign - Githubissues

RexYuan commented 8 years ago

基於一項由 @jaidTw 的重大發現，讓更多資料以及更有效率的 scraping 變為可能；而因此原先的暴力法（見 URL GET Request Components Guide 中的 p.s. 1）已經不再需要。

學校的系統似乎是先以JSON出資料，再以某些技術來呈現。新的方式將會是直接從選課系統或是校務行政系統中的課程查詢獲取資料，範例：104學年上學期的資工系，由選課系統得到的資料和由開課查詢系統查詢得到的。兩者間獲得的資料相同，但是選課系統的 URL 明顯較長，因此使用校務行政系統將是較優化的選擇。

一段從剛剛得到的資料的擷取：

{"acadmTerm":"1","acadmYear":"104","authorizeP":20,"chnName":"程式設計（一）","class1":"","courseCode":"CSU0001","courseGroup":"","courseKind":"半","credit":"3.0","deptCode":"SU47","deptGroup":"","engTeach":"否","formS":"1","insDeptCode":"SU47","limitCountH":50,"moocs":"N","optionCode":"必修","serialNo":"3025","sex_restrict":"","teacher":"蔣宗哲","timeInfo":"三 8-9 公館 理圖807,五 7 公館 理圖807,","v_chn_name":"程式設計（一）","v_class1":"","v_comment":"","v_deptChiabbr":"資工系","v_deptGroup":"","v_error":"","v_is_Full":"","v_limitCourse":"","v_phase":"","v_priority":0,"v_release_time":"","v_reserve_count":0,"v_stage":0,"v_stfseld":0,"v_stfseld_auth":0,"v_stfseld_deal":0,"v_stfseld_exchange":0,"v_stfseld_undeal":0,"v_stfseld_unfull":0},
{"acadmTerm":"1","acadmYear":"104","authorizeP":20,"chnName":"計算機概論","class1":"","courseCode":"CSU0006","courseGroup":"","courseKind":"半","credit":"3.0","deptCode":"SU47","deptGroup":"","engTeach":"否","formS":"1","insDeptCode":"SU47","limitCountH":50,"moocs":"N","optionCode":"必修","serialNo":"3026","sex_restrict":"","teacher":"林順喜","timeInfo":"一 2 公館 Ｂ102,四 3-4 公館 Ｂ102,","v_chn_name":"計算機概論","v_class1":"","v_comment":"","v_deptChiabbr":"資工系","v_deptGroup":"","v_error":"","v_is_Full":"","v_limitCourse":"","v_phase":"","v_priority":0,"v_release_time":"","v_reserve_count":0,"v_stage":0,"v_stfseld":0,"v_stfseld_auth":0,"v_stfseld_deal":0,"v_stfseld_exchange":0,"v_stfseld_undeal":0,"v_stfseld_unfull":0}

已經可以從中看出這同時解決了 #15 ，並且開啟了非常多其他的可能性，如：模擬選課等。

只要嘗試改變 deptCode 就能改變要求的系所，明顯證明該方法可行性，而所有的這些對應系所的變數皆已存在 department 這個 table 裡。

由於這項發現，原有的 scraper 已經不再被需要，而所需的時間也將極大化地被縮減，我將會在今天或明天寫出一份簡單的範例 scraper，和一個簡單的 JSON 解讀手冊，以便搭配 #19 重新設計資料庫的進行。

RexYuan commented 8 years ago

New Scraping Guide：尚未完全解析的手冊 new_scrap.php：尚未加上儲存資料庫的 scraper

RexYuan commented 8 years ago

目前在 67bcb3ccf4439160763b58cb638c57af327bd001，基礎的 scraper 版本已經完成。只差之後依照資料庫結構的設計 #19 進行微調的動作。

jaidTw commented 8 years ago

關於Scraping Guide所提到的系所代碼Request部分，type參數為回傳資料使用之語言，如：type=chn代表要求回傳資料為中文名稱，若更改為type=eng則回傳資料為系所英文名稱。

jaidTw commented 8 years ago

系所之課程清單Request部分，剩餘未知項目茲分析如下：

chn ：課程名稱filter。
engTeach : 全英語授課filter，其值為Y/N。
moocs : MOOCS filter，其值為Y/N。
remoteCourse : 遠距教學filter，其值為Y/N。
classCode : 開課班級filter，其值如下
- 1 甲班
- 2 乙班
- 3 丙班
- 4 丁班
- 7 大碩博合開
- 8 碩博合開
- 9 大碩合開
generalCore : 通識核心領域filter，其值如下
- 1 藝術與美感
- 2 哲學思維與道德推理
- 3 公民素養與社會探究
- 4 歷史與文化
- 5 數學與科學思維
- 6 科學與生命
- 7 一般通識
- 8 所有通識
_dc參數為Ext.js之disableCaching功能，若設定為true，則會在AJAX Request時自動加上。

RexYuan commented 8 years ago

補充：_dc 參數為 Ext.js 的 disableCaching 功能的結果。目的是為了讓伺服器避免 GET Request 的 cache，詳細閱讀：Removing _dc parameter in Ext、GET、POST與cache的關係和 Caching in HTTP。更深入的研究發現到其參數為以毫秒為單位的 UNIX POSIX Time（和 Python 的 time.time() 回傳值一樣）。範例中的 dc=1439021294652 代表的就是 8/8/2015, 4:08:14 PM GMT+8:00。

jaidTw commented 8 years ago

關於查詢課程回應的JSON參數意義，經由CofopdlCtrl結果分析如下

acadm_term 學期
acadm_year 學年
authorize_p 授權碼名額
chn_name 課程中文名稱
class_name 開課班級名稱
classes 開課班級代碼
comment 註解
counter_exceptAuth 選課總人數
course_code 開課代碼
course_group 組
course_kind 全/半
credit 學分數
dept_chiabbr 系所中文名稱
dept_code 系所代碼
dept_group 系組
eng_name 課程英文名稱
eng_teach 全英語授課
form_s 年
gender_restrict 性別限制
limit 聯盟(各校)開放人數
limit_count_h 限修人數
moocs_teach MOOCS
option_code 必/選
restrict 擋修限制
rt 遠距教學
selfTeachName 正課/實驗親授
serial_no 開課序號
status 停開
teacher 教師名稱
time_inf 時間地點

以下是只會在英文版網頁用到的資訊：

tname 教師英文姓名

以下參數目前未見使用：

authorize_r [未確定]允許授權碼比例
authorize_using [已廢除]授權碼選課人數(可能包含預選)
counter [已廢除]選課人數(含授權碼)
counter_exceptAuth [已廢除]選總人數(不含授權碼) 註：此項現改為選課總人數
full_flag [未確定]課程已滿與否
deleteQ
dept_group_name
form_s_name
hours
iCounter
precentage
selfTeach
not_choose
cancel
exp_hours
fillcounter
for_query
course_avg
school_avg
send_time
tcode
week_section1
week_section2
week_section3
week

jaidTw commented 8 years ago

Scraper目前來看需要分為兩個部分第一個部分是抓取課程資訊，只需要在課程公告後執行一次，第二個部分是抓取即時的選課人數資訊，在選課期間必須定期更新。

RexYuan commented 8 years ago

I'm closing this issue because, as of c04dec49efdfe5dd01c2d65433f2cb3317cc3c6f, we've achieved a working prototypical scraper. The rest of potential enhancement and adjustment for the second part of what $jaidTw mentioned will be resolved at #25.

RexYuan / courseNTNU

Scraper Redesign #20