TEIC / Stylesheets

TEI XSL Stylesheets
238 stars 126 forks source link

GROBID TEI to bibtex #289

Open stzellerhoff opened 7 years ago

stzellerhoff commented 7 years ago

Hi,

I am using a dockerized version of grobid to extract references from scientific pdfs. The available output format is TEI (direct bibtex is not possible using the docker version). Converting it using the teitobibtex script produces incorrect bibtex files. Does anyone know how to solve this problem? Thank you!

Stephan

martindholmes commented 7 years ago

Could you provide samples of the TEI and the bibtex output, and describe the features which are incorrect?

stzellerhoff commented 7 years ago

Hi!

GROBID tei output:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">

<text>
    <front/>
    <body/>
    <back>
        <div>
            <listBibl>
Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars KSoejima WgStevenson JlSapp ApSelwyn GCouper LmEpstein J Am Coll Cardiol 43
            </listBibl>
        </div>
    </back>
</text>

Bibtex result:

@article{b0, title={{Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars}}, author={{KSoejima} and {WgStevenson} and {JlSapp} and {ApSelwyn} and {GCouper} and {LmEpstein}}, journal={{J Am Coll Cardiol}}43, year={} }

Author forenames and surnames are merged, issue, pages, and publication year are empty. Thank yout!

stzellerhoff commented 7 years ago

Hi!

GROBID Tei output:

`<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">

<text>
    <front/>
    <body/>
    <back>
        <div>
            <listBibl>
Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars KSoejima WgStevenson JlSapp ApSelwyn GCouper LmEpstein J Am Coll Cardiol 43
            </listBibl>
        </div>
    </back>
</text>

`

Bibtex result:

@article{b0, title={{Endocardial and epicardial radiofrequency ablation of ventricular tachycardia associated with dilated cardiomyopathy: the importance of low-voltage scars}}, author={{KSoejima} and {WgStevenson} and {JlSapp} and {ApSelwyn} and {GCouper} and {LmEpstein}}, journal={{J Am Coll Cardiol}}43, year={} }

Author forenames and surnames are merged, issue, pages, and publication year are empty. Thank you!

Stephan

stuartyeates commented 7 years ago

Looks like the name problem is in https://github.com/TEIC/Stylesheets/blob/dev/bibtex/convertbib.xsl about line 176. It's looking for a tei:author/tei:surname and is presented with a tei:author/tei:persName/tei:surname instead.

Should be pretty straight forward to add conditional clauses for that (and the same for editor above), but I'm not in front of suitable machine to code and test that right now.

The date is encoded purely as an attribute rather that as XML text, which is not really the TEI way,but could be handled about line 86 of the same file.

cheers stuart

stzellerhoff commented 7 years ago

Hi!

I gave it a try, but could not fix the output correctly - probably due to a lack of knowing how to exactly...