Open comsaint opened 9 years ago
A side-note: The RawCouncilQuestion
instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm
, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.
A side-note: The RawCouncilQuestion
instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm
, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.
There is a reply link alongside each question on e.g. Legco 13-14. It returns a well-structured HTML page that consists of both question and reply. Since we will need to scrape the replies eventually, we may consider moving the creation of question instances here, i.e. scrape both questions and replies from that same page. The drawbacks are that:
Older questions (from year 2005-2006 back) do not have such a reply link. Need to parse the Hansard instead. However, since we need to parse the Hansard anyway, there is no extra work.
Found a note in raw.models.parsed.QuestionManager.populate()
, which shares my idea above.
Multiple issues with agenda questions:
responders
in the modelRawCouncilQuestion
.