axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.86k stars 311 forks source link

Error: Module lines-to-paragraph has unresolved dependencies (words-to-line-new) #493

Closed philipskokoh closed 4 years ago

philipskokoh commented 4 years ago

Hi,

I follow the guide to run Parsr using docker container

docker pull axarev/parsr
docker run -p 3001:3001 axarev/parsr

However, when I send request to parse a pdf file, I get this error. "Error: Module lines-to-paragraph has unresolved dependencies (words-to-line-new)" (see screenshot below)

Screenshot 2020-09-23 at 6 36 15 PM

This is the config.json I use. I think I may miss out some configuration or words-to-line-new module could be missing in the docker images.

Please advise. Thank you very much.

{
  "version": 0.5,
  "extractor": {
    "pdf": "pdfminer",
    "img": "tesseract",
    "language": ["eng", "fra"]
  },
  "cleaner": [
    "out-of-page-removal",
    [
      "whitespace-removal",
      {
        "minWidth": 0
      }
    ],
    [
      "redundancy-detection",
      {
        "minOverlap": 0.5
      }
    ],
    [
      "table-detection",
      {
        "runConfig": [
          {
            "pages": [],
            "flavor": "lattice"
          }
        ]
      }
    ],
    [
      "header-footer-detection",
      {
        "ignorePages": [],
        "maxMarginPercentage": 15
      }
    ],
    [
      "reading-order-detection",
      {
        "minVerticalGapWidth": 5,
        "minColumnWidthInPagePercent": 15
      }
    ],
    "link-detection",
    [
      "words-to-line",
      {
        "lineHeightUncertainty": 0.2,
        "topUncertainty": 0.4,
        "maximumSpaceBetweenWords": 100,
        "mergeTableElements": false
      }
    ],
    [
      "lines-to-paragraph",
      {
        "tolerance": 0.25
      }
    ],
    "heading-detection",
    "list-detection",
    "page-number-detection",
    "hierarchy-detection",
    [
      "regex-matcher",
      {
        "isCaseSensitive": true,
        "isGlobal": true,
        "queries": [
          {
            "label": "Car",
            "regex": "([A-Z]{2}\\-[\\d]{3}\\-[A-Z]{2})"
          },
          {
            "label": "Age",
            "regex": "(\\d+)[ -]*(ans|jarige)"
          },
          {
            "label": "Percent",
            "regex": "([\\-]?(\\d)+[\\.\\,]*(\\d)*)[ ]*(%|per|percent|pourcent|procent)"
          }
        ]
      }
    ]
  ],
  "output": {
    "granularity": "word",
    "includeMarginals": false,
    "formats": {
      "json": true,
      "text": true,
      "csv": true,
      "markdown": true,
      "pdf": false
    }
  }
}
dafelix42 commented 4 years ago

Hello @philipskokoh!

You need to update your config file because some module have changed. you have to replace:

[
      "words-to-line",
      {
        "lineHeightUncertainty": 0.2,
        "topUncertainty": 0.4,
        "maximumSpaceBetweenWords": 100,
        "mergeTableElements": false
      }
],

by words-to-line-new,

as well as the module name heading-detection by ml-heading-detection

Regards

philipskokoh commented 4 years ago

It works! Thanks @dafelix42! I think it's good if the repo has a documentation of list of standard modules.