BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

Implement `ArticleParser` subclass to parse TEI XML files #373

Closed FrancescoCasalegno closed 3 years ago

FrancescoCasalegno commented 3 years ago

Scope

As discussed here, TEI XML is the output format produced by GROBID when parsing PDF inputs.

We must be able to load into our databases those PDFs, so we need an ArticleParser subclass that is able to parse the TEI XML outputs of GROBID.

Alternative Solutions

The approach described above assumes that the ingestion of PDFs into our database operates in two separate steps. First, we use GROBID to convert PDF -> TEI XML. Second, we use our TeiXmlParser to parse the TEI XML.

An alternative approach would be to do all of this into a unique parser, PdfParser, which internally would first call the PDF to TEI XML converter of GROBID, and then parse that TEI XML output.

I personally tend to prefer to separate the two logics, first because it seems to me that it's more consistent with the other parsers we implemented, but also because we may have (not sure about this, though) TEI XML files to parse that are not generated by GROBID.

But I am happy to change my mind if there are good reasons, so whoever tackles this PR can propose either solution.

FrancescoCasalegno commented 3 years ago

Potentially closed by #406 ?

We need to check this, maybe TEI XML are slightly different from JATS XML.

EmilieDel commented 3 years ago

Three possibilities to address this ticket:

Our parse command bbs database parse goes from raw format to the final format (JSON for now). We should maybe keep this property.

FrancescoCasalegno commented 3 years ago

@EmilieDel

Create a new command bbs database convert_pdf to convert from PDF to TEI XML.

Sounds like a good idea. If we all agree, can you please create an Issue out of this and add it to the current Sprint?

Stannislav commented 3 years ago

@EmilieDel

Create a new command bbs database convert_pdf to convert from PDF to TEI XML.

Sounds like a good idea. If we all agree, can you please create an Issue out of this and add it to the current Sprint?

Just created #480