comsaint / legco-watch

Parliamentary monitoring for Hong Kong
MIT License
7 stars 3 forks source link

Fail to parse agenda questions #2

Open comsaint opened 9 years ago

comsaint commented 9 years ago

Multiple issues with agenda questions:

comsaint commented 9 years ago

A side-note: The RawCouncilQuestion instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.

comsaint commented 9 years ago

A side-note: The RawCouncilQuestion instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.

comsaint commented 9 years ago

There is a reply link alongside each question on e.g. Legco 13-14. It returns a well-structured HTML page that consists of both question and reply. Since we will need to scrape the replies eventually, we may consider moving the creation of question instances here, i.e. scrape both questions and replies from that same page. The drawbacks are that:

Older questions (from year 2005-2006 back) do not have such a reply link. Need to parse the Hansard instead. However, since we need to parse the Hansard anyway, there is no extra work.

comsaint commented 9 years ago

Found a note in raw.models.parsed.QuestionManager.populate(), which shares my idea above.