Closed minump closed 11 months ago
This will be version 0.7
"name": "pdf2text-extractor",
"version": "0.7",
Pushed to hub.ncsa.illinois.edu/clowder/extractors-pdf2text:0.7.0. Deployed to consort instance. Working fine. The tei.xml file contains sentence coordinates.
Merging to main.
Grobid process to get sentence coordinates. Use /api/processFulltextDocument with parameters "teiCoordinates" and "segmentSentences". This gives a TEI XML file with each sentence coordinates and page numbers. Eg :
<s coords="1,53.80,194.57,58.71,9.29">
indicates one bounding box with attributes page=1, x=53.80, y=194.57, w=58.71, h=9.29.<s coords="1,43.95,722.74,206.93,8.92;1,36.86,733.74,29.95,8.92;1,66.80,733.96,4.08,4.46;1,70.88,733.75,180.00,8.92">
The above @coords XML attributes introduces 4 bounding boxes to define the area of the sentence (typically because it is on several lines).