clowder-framework / extractors-s2orc-pdf2text

Extractor to convert pdf to text
Apache License 2.0
1 stars 0 forks source link

16 grobid sentence coordinates #17

Closed minump closed 11 months ago

minump commented 1 year ago

Grobid process to get sentence coordinates. Use /api/processFulltextDocument with parameters "teiCoordinates" and "segmentSentences". This gives a TEI XML file with each sentence coordinates and page numbers. Eg :

<facsimile>
        <surface n="1" ulx="0.0" uly="0.0" lrx="595.0" lry="842.0"/>
        <surface n="2" ulx="0.0" uly="0.0" lrx="595.0" lry="842.0"/>
        <surface n="3" ulx="0.0" uly="0.0" lrx="595.0" lry="842.0"/>
        <surface n="4" ulx="0.0" uly="0.0" lrx="595.0" lry="842.0"/>
    </facsimile>
<s coords="2,58.42,766.40,483.20,10.83;2,37.66,777.93,210.02,10.83;2,16.09,791.28,105.28,10.73">each group, number of participants (denominator) included in each analysis and whether the analysis was by original assigned groups No corresponding text</s>

<s coords="1,53.80,194.57,58.71,9.29"> indicates one bounding box with attributes page=1, x=53.80, y=194.57, w=58.71, h=9.29.

<s coords="1,43.95,722.74,206.93,8.92;1,36.86,733.74,29.95,8.92;1,66.80,733.96,4.08,4.46;1,70.88,733.75,180.00,8.92"> The above @coords XML attributes introduces 4 bounding boxes to define the area of the sentence (typically because it is on several lines).

minump commented 1 year ago

This will be version 0.7

  "name": "pdf2text-extractor",
  "version": "0.7",
minump commented 11 months ago

Pushed to hub.ncsa.illinois.edu/clowder/extractors-pdf2text:0.7.0. Deployed to consort instance. Working fine. The tei.xml file contains sentence coordinates.

minump commented 11 months ago

Merging to main.