What if Impact Statement is not a h1?

earlng commented 3 years ago

Describe the bug The BIS title is not contained in an <h1> tag (e.g. there is a more general “Conclusion” <h1> tag or just bad xml tagging) and therefore it is not scraped

To Reproduce Papers that this problem occurs in:

9f1d5659d5880fb427f6e04ae500fc25 (contained in <h2>)
7a43ed4e82d06a1e6b2e88518fb8c2b0 (contained in <h2>)
4b29fa4efe4fb7bc667c7b301b74d52d (contained in <h2>)
c589c3a8f99401b24b9380e86d939842 (contained in a <region> tag, no title)

Expected behavior The code should be able to find the impact statement even if it is not in the h1 tag.

Proposed fix look for a BIS through <h2> tags as well, or simply pull any text content that contains “broader impact”?

earlng commented 3 years ago

Sample from 9f1d5659d5880fb427f6e04ae500fc25:

      </section>
      <section class="deo:Conclusion">
        <h1 class="DoCO:SectionTitle" id="187" page="9" column="1">5 CONCLUSION</h1>
        <region class="DoCO:TextChunk" id="188" page="9" column="1">We incorporated discrete variables into neural variational without analytically integrating them out or reparametrizing and running stochastic backpropagation on them. Applied to a recurrent, neural topic model, our approach maintains the discrete topic assignments, yielding a simple yet effective way to learn thematic vs. non-thematic (e.g., syntactic) word dynamics. Our approach outperforms previous approaches on language understanding and other topic modeling measures.</region>
        <outsider class="DoCO:TextBox" type="page_nr" id="189" page="9" column="1">9</outsider>
        <section class="DoCO:Section">
          <h2 class="DoCO:SectionTitle" id="190" confidence="possible" page="10" column="1">Broader Impact</h2>
          <region class="DoCO:TextChunk" id="191" page="10" column="1">The model used in this paper is fundamentally an associative-based language model. While NVI does provide some degree of regularization, a significant component of the training criteria is still a cross-entropy loss. Further, this paper’s model does not examine adjusting this cross-entropy component. As such, the text the model is trained on can influence the types of implicit biases that are transmitted to the learned syntactic component (the RNN/representations h t ), the learned thematic component (the topic matrix β and topic modeling variables θ and z t ), and the tradeoff(s) between these two dynamics ( l t and ρ ). For comparability, this work used available datasets that have been previously published on. Based upon this work’s goals, there was not an in-depth exploration into any biases within those datasets. Note however that the thematic vs. non-thematic aspect of this work provides a potential avenue for examining this. While we treated l t as a binary indicator, future work could involve a more nuanced, gradient view. Direct interpretability of the individual components of the model is mixed. While the topic weights can clearly be inspected and analyzed directly, the same is not as easy for the RNN component. While lacking a direct way to inspect the overall decoding model, our approach does provide insight into the thematic component. We view the model as capturing thematic vs. non-thematic dynamics, though in keeping with previous work, for evaluation we approximated this with non-stopword vs. stopword dynamics. Within topic modeling stop-word handling is generally considered simply a preprocessing problem (or obviated by neural networks), we believe that preprocessing is an important element of a downstream user’s workflow that is not captured when preprocessing is treated as a stand-alone, perhaps boring step. We argue that future work can examine how different elements of a user’s workflow, such as preprocessing, can be handled with our approach.</region>
        </section>
        <section class="DoCO:Section">
          <h2 class="DoCO:SectionTitle" id="192" confidence="possible" page="10" column="1">Acknowledgements and Funding Disclosure</h2>
          <region class="DoCO:TextChunk" id="193" confidence="possible" page="10" column="1">We would like to thank members and affiliates of the UMBC CSEE Department, including Edward Raff, Cynthia Matuszek, Erfan Noury and Ahmad Mousavi. We would also like to thank the anonymous reviewers for their comments, questions, and suggestions. Some experiments were conducted on the UMBC HPCF. We’d also like to thank the reviewers for their comments and suggestions. This material is based in part upon work supported by the National Science Foundation under Grant No. IIS-1940931. This material is also based on research that is in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003. The U.S.Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either express or implied, of the Air Force Research Laboratory (AFRL), DARPA, or the U.S. Government.</region>
        </section>
      </section>

And 7a43ed4e82d06a1e6b2e88518fb8c2b0:

        <section class="DoCO:Section">
          <h2 class="DoCO:SectionTitle" id="161" confidence="possible" page="9" column="1">7 Broader Impact</h2>
          <region class="DoCO:TextChunk" id="164" page="9" column="1">Learning a best arm is a fundamental, well-studied, problem largely because it captures the most basic experimental question: given n treatments, each with a stochastic outcome, which one is best? Cancer treatment, drug discovery, gene detection, manufacturing quality assurance, financial fraud detection, spam detection, software testing, are all examples of direct applications of learning a best arm. Providing dramatically faster algorithms for these applications without compromising on guarantees will impact areas well outside machine learning. Specifically, this work provides an algorithm that is 6000 times faster than the state-of-the-art. In addition to asymptotic bounds that converge as the number of arms grows to what we conjecture is the optimal sample complexity, we provide dramatic speedups for any number of arms. The result is an extremely efficient simple algorithm for learning <marker type="page" number="10"/><marker type="block"/> a best arm with strong theoretical guarantees that can be used across all applications of learning a best arm. The simplicity and speed of the algorithms presented here are such that any practitioner can implement them and accelerate their experimental setup immediately. We trust that we will see immediate action across a broad set of application domains.</region>
          <outsider class="DoCO:TextBox" type="page_nr" id="163" page="9" column="1">9</outsider>
        </section>

So in these two cases it doesn't look like there is an h1 tag at all. So one easy fix would be to simply expand the code filter to not only be limited to h1 but h2 as well.

earlng commented 3 years ago

Changing h1 to h2 may not be so simple. Testing it out, it seems that by looping through root[1][1] alone we only get h1 content, not h2. We need to go deeper.

The below is a sample from 9f1d5659d5880fb427f6e04ae500fc25

>>> for section in root[1][1]:
...     for child in section:
...             print(child.text, child.tag)
...
1 Introduction h1

           section

           section

           section

           section

           section

           section

           section

           section

           section
27 xref
19 xref
40 xref
37 xref
44 xref
29 xref
5 xref
20 xref
8 xref
42 xref
53 xref
45 xref
41 xref
27 xref
15 xref
8 xref
48 xref
51 xref
22 xref
25 xref
24 xref
None marker
None marker
8 xref
None marker
TCNLM h1
TGVAE h1
• : A single-layer LSTM with the same number of units like our method implementation but without any topic modeling, i.e. l t = 0 for all tokens. • [ region

           section
5 CONCLUSION h1
We incorporated discrete variables into neural variational without analytically integrating them out or reparametrizing and running stochastic backpropagation on them. Applied to a recurrent, neural topic model, our approach maintains the discrete topic assignments, yielding a simple yet effective way to learn thematic vs. non-thematic (e.g., syntactic) word dynamics. Our approach outperforms previous approaches on language understanding and other topic modeling measures. region
9 outsider

           section

           section
References h1

           ref-list
10 outsider
11 outsider
12 outsider
13 outsider

earlng commented 3 years ago

Yes, need go go one layer deeper than child. Sample:

for section in root[1][1]:
 for child in section:
  for smaller in child:
   print(smaller.text)

earlng commented 3 years ago

This seems to be partially resolved in 715ab09. Out of the 4 cited examples:

9f1d5659d5880fb427f6e04ae500fc25 (contained in <h2>)

7a43ed4e82d06a1e6b2e88518fb8c2b0 (contained in <h2>)

4b29fa4efe4fb7bc667c7b301b74d52d (contained in <h2>)

c589c3a8f99401b24b9380e86d939842 (contained in a <region> tag, no title)

All except c589c3a8f99401b24b9380e86d939842 now have impact statements. And in this specific case, it seems like there is no h2 header impact statement. The statement itself is part of the Conclusion.

earlng commented 3 years ago

@paulsedille have a look 😄

earlng / academic-pdf-scrap

What if Impact Statement is not a h1? #11