dracor-org / fredracor

French Drama Corpus
5 stars 1 forks source link

Refactor transformation process #3

Closed cmil closed 3 years ago

cmil commented 3 years ago

This is a major refactoring of the Théâtre Classique to Dracor (tc2dracor) transformation process with the goal of decoupling the retrieval of the sources and the actual transformation as well as making the overall process more flexible. There are the following significant changes:

  1. The transformation script does not deal with obtaining the original sources any more. This part has been moved to the theatre-classique repo.
  2. The XQuery code (tc2dracor.xq) has been reduced to transforming a single source document to the DraCor TEI structure. Most of the control flow has been moved to the bash script (tc2dracor) with added command line options for ease of use. This turns out to be a huge performance boost, so much so that I can now transform the entire corpus on a single database thread in less than 4min!
  3. The validation part has been moved to a separate script (validate) that can actually be used to validate other corpora as well.
  4. As a next step of improving the transformation result, the author information has been consolidated and matched to the various spellings and forms in the source documents.
cmil commented 3 years ago

Let's keep the refactoring branch so that the links in the PR description and possible elsewhere keep working.