earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

Capturing page number #16

Closed earlng closed 3 years ago

earlng commented 3 years ago

Describe the bug these are all examples of the the new code just pulling in a "9" at the end, which should be the page number—if there's a way to get the code not to pull in the page number that'd be great

To Reproduce Refer to these files:

  1. 0332d694daab22e0e0eaf7a5e88433f9
  2. 0415740eaa4d9decbc8da001d3fd805f
  3. 066f182b787111ed4cb65ed437f0855b

Expected behavior Ignore the "9" if possible.

Related to #10

earlng commented 3 years ago

Looking at 0415740eaa4d9decbc8da001d3fd805f this is the relevant XML:

      <section class="DoCO:Section">
        <h1 class="DoCO:SectionTitle" id="114" page="9" column="1">13 Broader impacts</h1>
        <region class="DoCO:TextChunk" id="115" page="9" column="1">Our work accelerates the simulation of mechanical meta-materials, and could lead to methods for accelerated simulation of other PDEs. More efficient materials design could have impact on a wide variety of downstream applications, such as soft robotics, structural engineering, biomedical engineering, and many more. Due to the incredibly wide variety of applications which might make use of advances in material design–every physical man-made object makes use of this science–it is difficult to precisely assess impact. However, we believe that meta-material driven advances in soft robotics and structural/biomedical engineering are likely to have a range of positive effects.</region>
        <outsider class="DoCO:TextBox" type="page_nr" id="116" page="9" column="1">9</outsider>
      </section>

So the reason is because of the line:

<outsider class="DoCO:TextBox" type="page_nr" id="116" page="9" column="1">9</outsider>

Since it technically is text and satisfies the logic introduced by the resolution of #10.

One solution is to change the filters:

if smaller.itertext() != "" and (smaller.attrib["class"] == "DoCO:TextChunk" or smaller.attrib["class"] == "DoCO:TextBox"):

and add:

and smaller.type != "page_nr"
earlng commented 3 years ago
and smaller.type != "page_nr"

this proposed method won't work because smaller.type (or equivalent) isn't always going to be found in each entry. However, I have found that smaller.attrib is a dictionary. So we can use a lookup "has keys" to check if smaller.attrib has the type attrib. And if it exists, then we can assume it is NOT something we want.

earlng commented 3 years ago

so in the actual filters, I can put in:

...and not("type" in child.attrib(keys))

But this requires me to make the assumption that the broader impact statements don't have any form of type.

earlng commented 3 years ago

If the BIS is spread across more than one page, the new code doesn’t pull what follows on the next page (because of the tag I suppose) — is there any way not to lose what comes after a page number?

  1. 48f7d3043bc03e6c48a6f0ebc0f258a8
  2. 05f971b5ec196b8c65b75d2ef8267331
  3. 1bda4c789c38754f639a376716c5859f
earlng commented 3 years ago

If we are willing to include the page numbers in the extracted BIS I can drop the conditionals that force the script to stop once it encounters (specifically) a page number.

earlng commented 3 years ago

If we are willing to include the page numbers in the extracted BIS I can drop the conditionals that force the script to stop once it encounters (specifically) a page number.

@paulsedille is willing to accept page numbers included in the BIS so I will remove the logic that precludes them.