allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
665 stars 52 forks source link

Bnewm0609/layer slice #44

Open bnewm0609 opened 1 year ago

bnewm0609 commented 1 year ago

Begins an implementation of Layer to wrap the List[Entities] and allow for more intuitive slicing (e.g. doc.sentences[:3].text instead of [sent.text for sent in doc.sentences]. This addresses #24 .

The new data structure is implemented in papermage/magelib/layer.py and inherits from python's UserList. The changes into integrate Layer were mainly made in the Document data structure.

One design decision to consider is what to do with chained access: e.g. doc.pages.paragraphs.sentences.tokens. Currently, each access creates a new layer, so doing the above would create a four-dimensional list. Two consequences of this decision:

  1. To get the first token, you would have to write doc.pages.paragraphs.sentences.tokens[0][0][0][0], which is a bit ugly.
  2. doc.pages.paragraphs.pages does not return the original Layer of pages., which is a bit uninutitive

The main question is: "Should chained accessing return the union of all of the entities in a single layer or should it return the entities in the shape of the chained accessing?"

As another example, if the doc is

Paragraph 1: "I am. I was."
Paragraph 2: "You are. You were."

Sentence 1: "I am."
Sentence 2: "I was."
Sentence 3: "You are."
Sentence 4: "You were."

Which should doc.paragraphs.sentences.text return?

# Option 1 - currently implemented
[[["I", "am", "."], ["I", "was", "."]], [["You", "are", "."], ["You", "were", "."]]]

# Option 2
["I", "am", ".", "I", "was", ".", "You", "are", ".", "You", "were", "."]