allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

Grobid Full Text Parser / Existing Document Augmenter #204

Closed geli-gel closed 1 year ago

geli-gel commented 1 year ago

full text grobid parser, works like this (examples/grobid_full_text_parser/grobid_bibs.ipynb):

parser.parse takes in an optional output directory to save the raw grobid xml. the parser takes in a config set with the basic grobid_client setup + an additionally requested coordinates piece "head" which contains the authors section. (had to look in chrome dev tools to find that out because I could only get the authors in the gui but not from the client, and couldn't find the info anywhere, also could not read the grobid js code!)

this is the first part of https://github.com/allenai/scholar/issues/35751 -- second part is going to be taking these raw grobid bibs and updating the spangroups so that the SpanGroup spans include tokens that indicate a bib's number, as well as use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of)

Next steps for this grobid parser are to make it output more pieces of the grobid output.

image