earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

Impact statement is split between multiple tags #10

Closed earlng closed 3 years ago

earlng commented 3 years ago

Describe the bug After the BIS title, the BIS content is split between multiple tags (e.g. ) but code only scrapes the first tag, not the entire BIS content (this happens often if BIS content is split into different paragraphs)

To Reproduce Papers to look into:

  1. b704ea2c39778f07c617f6b7ce480e9e
  2. 33a854e247155d590883b93bca53848a
  3. B460cf6b09878b00a3e1ad4c72344ccd
  4. 460191c72f67e90150a093b4585e7eb4
  5. 2290a7385ed77cc5592dc2153229f082 (this is an xml tagging error, but scraping the entire
    would fix it)

Expected behavior Grab the entirety of the impact statement. Not just the first portion that so happens to immediately follow the h1 tag.

Suggested Fix review use of itertext? Or instead of scraping the content after h1, we could pull the entire section that contains h1 (thus also scraping the title into the statement itself)?

earlng commented 3 years ago

Taking as examples, these are the relevant xmls:

b704ea2c39778f07c617f6b7ce480e9e:

      <section class="DoCO:Section">
        <h1 class="DoCO:SectionTitle" id="215" page="9" column="1">Broader Impact</h1>
        <region class="DoCO:TextChunk" id="216" page="9" column="1">We hope that this work will prove useful to the Continual Learning (CL) scientific community as it is fully reproducible and includes:</region>
        <region class="DoCO:TextChunk" id="217" confidence="possible" page="9" column="1">• a clear and extensive comparison of the state of the art on multiple datasets; • Dark Experience Replay (DER), a simple baseline that outperforms all other methods while maintaining a limited memory footprint.</region>
        <region class="DoCO:TextChunk" id="218" page="9" column="1">As revealed by the analysis in Section 5, DER also proves to be better calibrated than a simple Experience Replay baseline, which means that it could represent a useful starting point for the study of CL decision-making applications where an overconfident model would be detrimental. We especially hope that the community will benefit from the introduction of MNIST-360, the first evaluation protocol adhering to the General Continual Learning scenario. The latter has been recently proposed to describe the requirement of a CL system that can be applied to real-world problems. Widespread adoption of our protocol (or new ones of similar design) can close the gap between the current CL studies and practical AI systems. Due to the abstract nature of MNIST-360 (it only contains digits), we believe that ethical and bias concerns are not applicable.</region>
      </section>

33a854e247155d590883b93bca53848a:

      <section class="DoCO:Section">
        <h1 class="DoCO:SectionTitle" id="185" page="10" column="1">Broader Impact</h1>
        <region class="unknown" id="186" page="10" column="1">Who may benefit from this research</region>
        <region class="DoCO:TextChunk" id="192" page="10" column="1">Our research presumably has quite broad impact, since discovery of mathematical patterns in data is a central problem across the natural and social sciences. Given the ubiquity of linear regression in research, one might expect that there will significant benefits to a broad range of researchers also from more general symbolic regression once freely available algorithms get sufficiently good. <marker type="block"/> Although it is possible that some numerical modelers could get their jobs automated away by symbolic regression, we suspect that the main effect of our method, and future tools building on it, will instead be that these people will simply discover better models than today.<marker type="block"/> Pareto-optimal symbolic regression can be viewed as an extreme form of lossy data compression that uncovers the simplest possible model for any given accuracy. To the extent that overfitting can exacerbate bias, such model compression is expected to help. Moreover, since our method produces closed-form mathematical formulas that have excellent interpretability compared to black-box neural networks, they make it easier for humans to interpret the computation and pass judgement on whether it embodies unacceptable bias. This interpretability also reduces failure risk. Another risk is automation bias, whereby people overly trust a formula from symbolic regression when they extrapolate it into an untested domain. This could be exacerbated if symbolic regression promotes scientific laziness and enfeeblement, where researchers fit phenomenological models instead of doing the work of building models based on first principles. Symbolic regression should inform but not replace traditional scientific discovery. Although the choice of basis functions biases the discoverable function class, our method is agnostic to basis functions as long as they are mostly differentiable. The greatest potential risk associated with this work does not stem from it failing but from it suc- ceeding: accelerated progress in symbolic regression, modularity discovery and its parent discipline, program synthesis, could hasten the arrival of artificial general intelligence, which some authors have argued humanity still lacks the tools to manage safely [<xref ref-type="bibr" rid="R5" id="191" class="deo:Reference">5</xref>]. On the other hand, our work may help accelerate research on intelligible intelligence more broadly, and powerful future artificial intelligence is probably safer if we understand aspects of how it works than if it is an inscrutable black box.</region>
        <region class="unknown" id="188" page="10" column="1">Who may be put at disadvantage from this research</region>
        <region class="unknown" id="190" page="10" column="1">Risk of bias, failure and other negative outcomes</region>
        <outsider class="DoCO:TextBox" type="page_nr" id="193" page="10" column="1">10</outsider>
      </section>

So it seems like in these cases if I can find the <section> as a whole I can create another loop that just goes through each of the text entries until I the end of the section (i.e. until it hits </section>). Appending the text to the variable impact_statement_text each time.

I would need to experiment with this one to see if </section> is something that can be observable in the code.

earlng commented 3 years ago
  1. b704ea2c39778f07c617f6b7ce480e9e
  2. 33a854e247155d590883b93bca53848a
  3. B460cf6b09878b00a3e1ad4c72344ccd
  4. 460191c72f67e90150a093b4585e7eb4
  5. 2290a7385ed77cc5592dc2153229f082 (this is an xml tagging error, but scraping the entire would fix it)

Out of these entries, 1 & 4 benefitted from the be5e2a9 commit. The main reason is that the other three examples are not properly XML parsed that they do not satisfy the condition:

if child.itertext() != "" and (child.attrib["class"] == "DoCO:TextChunk" or child.attrib["class"] == "DoCO:TextBox")

@paulsedille have a look.

earlng commented 3 years ago

0169cf885f882efd795951253db5cdfb

The proposed solution doesn't seem to have fixed this particular XML.

earlng commented 3 years ago

0169cf885f882efd795951253db5cdfb

The proposed solution doesn't seem to have fixed this particular XML.

The script is not iterating through this one because it makes an initial capture at child, but then when it goes down to smaller while signal == 1, the if statement there turns signal = 0 and then it exits out.

earlng commented 3 years ago

This is the relevant XML:

      <section class="DoCO:Section">
        <h1 class="DoCO:SectionTitle" id="164" page="10" column="1">Broader Impact</h1>
        <region class="DoCO:TextChunk" id="167" page="10" column="1">The introduction of benchmark new datasets has historically fueled progress in machine learning. However, recent large-scale datasets are immense, which makes ML research over-reliant on massive computation cycles. This biases research advances towards fast and computationally-intensive methods, leading to economic and environmental impacts. Economically, reliance on massive computation cycles creates disparities between researchers and organizations with limited computation and hardware budgets, versus those with more resources. With regard to the environment and climate change, recent analyses [<xref ref-type="bibr" rid="R48" id="165" class="deo:Reference">48</xref>, <xref ref-type="bibr" rid="R28" id="166" class="deo:Reference">28</xref>] conclude that the greenhouse gases emitted from training very large-scale models, such as transformers, can be equivalent to 10 years’ worth of individual emissions. While these impacts are not unique to ML research, they are reflective of systemic challenges that ML research could address by developing widely available, high-quality, and diverse datasets that mimic real-world concepts and are conscious of computational hurdles. The Synbols synthetic dataset generator is designed to explore the behavior of learning algorithms and discover brittleness to certain configurations. It is expected to stimulate the improvement of core properties in vision algorithms:</region>
        <region class="DoCO:TextChunk" id="168" confidence="possible" page="10" column="1">• Identifiability of latent properties • Reusability of machine learning models in new environments • Robustness to changes in the data distribution • Better performance on small datasets</region>
        <region class="DoCO:TextChunk" id="171" page="10" column="1">We designed Synbols with diverse and flexible features (i.e., font diversity, different languages, lower resolution for POC, flexible texture of the background and foreground) and we demonstrated its versatility by reporting numerous findings across 5 different machine learning paradigms. Its characteristics address the economic and environmental challenges and we expect this tool to have a transversal impact on the field of machine vision with potential impact on the field of machine learning. Its broader impacts, both positive and negative, will be guided by the progress that it stimulates in the machine vision community, where potential applications range from autonomous weapons to climate change mitigation. Nevertheless, we hope that our work will help develop more robust and reliable machine learning algorithms while reducing the amount of greenhouse gas emissions from training by way of its smaller scale. Economic impact: reliance on computing-intensive environments creates disparities, especially for researchers and organizations with limited computation and hardware budgets. Environmental impact: Recent analyses [<xref ref-type="bibr" rid="R45" id="169" class="deo:Reference">45</xref>, <xref ref-type="bibr" rid="R28" id="170" class="deo:Reference">28</xref>] conclude that the greenhouse gases emitted from training very large-scale models, such as transformers, can be equivalent to 10 years’ worth of individual emissions</region>
      </section>

The reason is because there is a smaller entry here <xref ref-type="bibr" rid="R48" id="165" class="deo:Reference">48</xref> that gets captured when we loop down to smaller

Since it exists and doesn't satisfy:

if smaller.itertext() != "" and (smaller.attrib["class"] == "DoCO:TextChunk" or smaller.attrib["class"] == "DoCO:TextBox") and not("type" in smaller.attrib.keys()):

The loop exists out.

The solution could be to add:

elif smaller.attrib["ref-type"] == "bibr":
     continue