OpenArabicPE / convert_tei-to-bibliographic-data

Generate bibliographic (MODS, BibTeX, CSV, JSON) data for <div>s in TEI XML files using <biblStruct> as an intermediary format
Other
5 stars 2 forks source link

Trying to convert from TEI <bibl> to <biblStruct> results in XSLTParseError #1

Closed cboulanger closed 1 month ago

cboulanger commented 2 months ago

Using the XSLTs in this repo, I am trying to convert TEI documents containing <bibl> elements to <biblStruct> for further processing.

Here is an example of my data:

<?xml version="1.0" ?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>10.1111_1467-6478.00057</title>
      </titleStmt>
      <publicationStmt>
        <publisher>mpilhlt</publisher>
      </publicationStmt>
      <sourceDesc>
        <p>10.1111_1467-6478.00057</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <p>The article text is not part of this document</p>
    </body>
    <note n="1" type="footnote" place="bottom">
      <bibl>
        <author>
          <persName>
            <forename>A.</forename>
            <surname>Phillips</surname>
          </persName>
        </author>
        , ‘
        <title level="a">Citizenship and Feminist Politics</title>
        ’ in
        <title level="m">Citizenship</title>
        , ed.
        <editor>
          <persName>
            <forename>G.</forename>
            <surname>Andrews</surname>
          </persName>
        </editor>
        (
        <date>1991</date>
        )
        <biblScope unit="pp">77</biblScope>
        .
      </bibl>
    </note>
   <!-- snip -->
   </text>
</TEI>

I am using the following python code:

from lxml import etree
import glob

def apply_xslt(xslt_path, xml_input_path, xml_output_path):
    try:
        xslt = etree.parse(xslt_path)
        xml = etree.parse(xml_input_path)
        transformer = etree.XSLT(xslt)
        new_xml = transformer(xml)
        with open(xml_output_path, 'w', encoding='utf-8') as f:
            f.write(new_xml)
    except etree.XSLTParseError as e:
        print(f"Error parsing XSLT file at {xslt_path}: {e}")

for input_path in glob.glob('tei/*.xml'):
    base_name = os.path.basename(input_path)
    apply_xslt('lib/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_functions.xsl', 
               input_path, f'tei-biblstruct/{base_name}')

Running the code, I get

Error parsing XSLT file at lib/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_functions.xsl: Failed to compile predicate

When I open the file in my IDE, it displays many errors, such as

Also, the imports

    <xsl:include href="parameters.xsl"/>
    <xsl:import href="../../authority-files/xslt/functions.xsl"/>

are not resolved: the second one seems to be an invalid path outside the repo, the first one also displays a non-resolvable although if interpreted as a relative path, it should resolve.

Any idea what could be the problem?

cboulanger commented 2 months ago

sorry wrong XSLT file - it should of course be convert_tei-to-bibliographic-data-master/xslt/convert_tei-to-biblstruct_articles.xsl

Now the error is Error parsing XSLT file at lib/convert_tei-to-bibliographic-data-master/xslt/convert_tei-to-biblstruct_articles.xsl: xsl:when : could not compile test expression '@type = ('section', 'item')'

cboulanger commented 2 months ago

I understand now that this repo depends on other repos of yours, such as https://github.com/OpenArabicPE/authority-files - it would be great if you added the list of dependencies to the documentation. When I run the demo script, I get I/O error reported by XML parser processing file:/<snip>/convert-anystyle-data/xslt-calendar-conversion/functions/date-functions.xsl. Caused by java.io.FileNotFoundException: /<snip>/convert-anystyle-data/xslt-calendar-conversion/functions/date-functions.xsl (No such file or directory) - where do I find xslt-calendar-conversion?

cboulanger commented 2 months ago

From our conversion on Mastodon, for the record:

I've tried it directly with saxon and the xslt hosted on GitHub, but no luck yet:

$saxon -s:"tei/10.1111_1467-6478.00057.xml" -xsl:"https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl"
Error on line 6 column 88 of functions.xsl:
  XTSE0165  I/O error reported by XML parser processing
  https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl.
  Caused by java.io.IOException: Server returned HTTP response code: 400 for URL:
  https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl
I/O error reported by XML parser processing https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl
cboulanger commented 2 months ago

I updated the schema of the source TEI. An example is here. The error remains, unfortunately.

cboulanger commented 1 month ago

After your fixes, I got it to work using Saxon, as documented here: https://gitlab.gwdg.de/boulanger/experiments/-/blob/main/convert-anystyle-data/tei-to-bibformats.ipynb , If I understand correctly, lxml doesn't work because the stylesheets use xslt-2.0 features.