Closed FrancescoCasalegno closed 3 years ago
Potentially closed by #406 ?
We need to check this, maybe TEI XML are slightly different from JATS XML.
Three possibilities to address this ticket:
bbs database convert_pdf
to convert from PDF
to TEI XML
. Our parse command bbs database parse
goes from raw format to the final format (JSON
for now). We should maybe keep this property.
@EmilieDel
Create a new command
bbs database convert_pdf
to convert fromTEI XML
.
Sounds like a good idea. If we all agree, can you please create an Issue out of this and add it to the current Sprint?
@EmilieDel
Create a new command
bbs database convert_pdf
to convert fromTEI XML
.Sounds like a good idea. If we all agree, can you please create an Issue out of this and add it to the current Sprint?
Just created #480
Scope
As discussed here, TEI XML is the output format produced by
GROBID
when parsing PDF inputs.We must be able to load into our databases those PDFs, so we need an
ArticleParser
subclass that is able to parse the TEI XML outputs ofGROBID
.Alternative Solutions
The approach described above assumes that the ingestion of PDFs into our database operates in two separate steps. First, we use
GROBID
to convertPDF
->TEI XML
. Second, we use ourTeiXmlParser
to parse theTEI XML
.An alternative approach would be to do all of this into a unique parser,
PdfParser
, which internally would first call thePDF
toTEI XML
converter ofGROBID
, and then parse thatTEI XML
output.I personally tend to prefer to separate the two logics, first because it seems to me that it's more consistent with the other parsers we implemented, but also because we may have (not sure about this, though)
TEI XML
files to parse that are not generated byGROBID
.But I am happy to change my mind if there are good reasons, so whoever tackles this PR can propose either solution.