aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
382 stars 140 forks source link

[Doc] Documentation of Linearizable and their methods e.g, get_text(config) #322

Open oonisim opened 6 months ago

oonisim commented 6 months ago

Document class has get_text(config: TextLinearizationConfig) method as in the example Using Layout Analysis for Text Linearization cell 19.

from textractor.data.text_linearization_config import TextLinearizationConfig

config = TextLinearizationConfig(
    hide_figure_layout=True,
    title_prefix="# ",
    section_header_prefix="## "
)
print(document.get_text(config=config))    # <--- get_text() method

However, it looks the documentation only has get_text_and_words method but it does not have get_text which is the method of the parent class Linearizable(ABC): .

It would be desirable to have a clear definition and explanation of what Linearizable and what methods it has, it is being used in the sample codes, rather than going through the github code to verify what it is.

Belval commented 6 months ago

Linearizable is new from last week as an attempt to unify the entity-level linearization approaches, but I agree that we have a documentation blind spot. Will be looking into improving it next week.