IsaakDeMaio / regents_study_tool

0 stars 0 forks source link

Reader doesn't parse the last question in a doc #2

Open cproctor opened 2 years ago

cproctor commented 2 years ago

From discord:

re: 73: document is a beautiful soup object representing the tag , which contains the entire html document. document.body is html's tag, which contains all the page content. contents is an iterator over a tag's child tags (see https://beautiful-soup-4.readthedocs.io/en/latest/#contents-and-children). So the for-loop on #73 gives us all the top-level tags under , one by one. The reason this works is because I looked at the structure of the HTML doc produced from the DOCX, and each question, collection of choices, and answer is contained in a tag under . You can see this using the browser developer tools. So as I go through the for-loop on #73, I know each time I'm either dealing with a question, choices, or an answer. I need to do the right thing depending on what it is. One complication: very occasionally, a question contains two or more tags. I believe this happens when the question starts, then presents a block of math, and then has more text. So there are situations where I've started reading a question, and then I need to keep reading the question on the next tag. Therefore, I can only know a question is finished when another question starts--or (just thinking about this now) when the answers start. However, when the first question starts, this is the only time a new question doesn't indicate an old question has finished. So on line 75 I check whether question is defined before adding it to the questions. Just realizing why the last question doesn't get added: Because when the answers start (line 80), we cheerfully start taking care of answers but forget to add the final question to questions. So line 81 should add the final question to questions.

cproctor commented 2 years ago

I just added a test to check that the correct number of questions are parse (bd44047)