CrucibleSDS / tungsten

A material safety data sheet (MSDS) parser.
https://pypi.org/project/tungsten-sds/
MIT License
6 stars 4 forks source link

Sigma-Aldrich: Group paragraph elements or subsections together into a single `HierarchyNode` #5

Open GreenCappuccino opened 2 years ago

GreenCappuccino commented 2 years ago

Currently, due to setting minimum line spacing, certain real paragraphs are split up. For example:

" ... "
        {
          "title": "OTHER_OTHER",
          "items": [],
          "raw_title": "The branding on the header and/or footer of this document may temporarily not visually \n"
        },
        {
          "title": "OTHER_OTHER",
          "items": [],
          "raw_title": "match the product purchased as we transition our branding. However, all of the \n"
        },
        {
          "title": "OTHER_OTHER",
          "items": [],
          "raw_title": "information in the document regarding the product remains unchanged and matches the \n"
        },
        {
          "title": "OTHER_OTHER",
          "items": [],
          "raw_title": "product ordered. For further information please contact mlsbranding@sial.com. \n"
        },
" ... "

Should be processed into one subsection with a TEXT item inside:

" ... "
        {
          "title": "OTHER_OTHER",
          "items": [
            {
              "type": "TEXT",
              "data": "The branding on the header and/or footer of this document may temporarily not visually match the product purchased as we transition our branding. However, all of the information in the document regarding the product remains unchanged and matches the product ordered. For further information please contact mlsbranding@sial.com."
            },
          ]
        },
" ... "
GreenCappuccino commented 2 years ago

I'm thinking of possibly solving this problem by avoiding it altogether. There's somewhat of a pattern for when certain subsections are marked with boldface or not. Maybe if we can detect start and ends of subsections through those methods, we could just treat all lines under that group as a single paragraph. Would require some modification of the initial hierarchy generator though, so I'm likely going to look at solving other problems for now.