The working files and data are housed in m-k-manuscript-data.

The Manuscript class represents a python version of BnF Ms 640. It contains a list of Entry objects, which hold the raw XML data from each entry along with some other data such as categories, title, ID, and properties.

When Manuscript is instantiated, it reads a folder full of folios and processes it into its component entries, each of which becomes a Entry object. is a script that generates the Manuscript object with all the folders in the ms-xml directory of your local m-k-manuscript-data repository and then writes derivative forms and the entry-metadata table to your local m-k-manuscript-data repository.

The derivative files/folders are:

Note for TXT versions:


These are detailed instructions for how to use For simpler steps to updating derivatives in m-k-manuscript-data, see the README in that repo.

  1. Update m-k-manuscript-data master branch:

    1. cd to m-k-manuscript-data directory
    2. git fetch
    3. git pull to make sure you are up to date
    4. Checkout a branch: git checkout -b [name of branch] -- though you will be running code in the manuscript-object directory, its output will be written to the m-k-manuscript-data directory (i.e., the changes will be made in that directory)
  2. Navigate to your local manuscript-object directory: cd to manuscript-object directory
  3. Make sure you're on the correct (usually master) branch by typing git status. If you're not in the correct branch, type git checkout -b [BRANCH_NAME].
  4. git fetch
  5. git pull
  6. Run Detailed instructions are below, but specific tasks are listed here:
    • To regenerate all the derivative files from originals: python3
    • To test without generating any derivative files: python3 -d
    • To only generate specific derivatives: python3 [--all-folios] [--entries] [--metadata] [--txt], without the brackets
    • To generate a derivative and write its output to a folder of your choice: python3 <DERIVATIVE TAG> [PATH/TO/FOLDER], without the brackets
    • e.g.: python3 --metadata ./test-metadata/ will write entry-metadata.csv to the test-metadata/ directory instead of the default, which is the metadata/ directory in your local m-k-manuscript-data repo
    • NOTE: as of January 6th, 2021, this will break if you try to provide a path to a folder that does not exist.
    • To show the help message: python3 -h
usage: [-h] [-d] [-v] [-b] [-a [ALL_FOLIOS]] [-m [METADATA]] [-t [TXT]] [-e [ENTRIES]] [path]

Generate and update derivative files from original ms-xml folios.

positional arguments:
  path                  Path to m-k-manuscript-data directory. Defaults to the sibling of your current directory.

optional arguments:
  -h, --help            show this help message and exit
  -d, --dry-run         Generate as usual, but do not write derivatives.
  -v, --verbose         Write verbose generation progress to stdout.
  -b, --bypass          Bypass user y/n confirmation. Useful for automation.
  -a [ALL_FOLIOS], --all-folios [ALL_FOLIOS]
                        Update allFolios derivative files. Disables generation of other derivatives unless those are
                        also specified. Optional argument: folder path to which to write derivative files.
  -m [METADATA], --metadata [METADATA]
                        Update metadata derivative files. Disables generation of other derivatives unless those are
                        also specified. Optional argument: folder path to which to write derivative files.
  -t [TXT], --txt [TXT]
                        Update ms-txt derivative files. Disables generation of other derivatives unless those are also
                        specified. Optional argument: folder path to which to write derivative files.
  -e [ENTRIES], --entries [ENTRIES]
                        Update entries derivative files. Disables generation of other derivatives unless those are
                        also specified. Optional argument: folder path to which to write derivative files.


Note: if you have multiple versions of Python 3 installed, specify that version when running bash commands. E.g., python3.7 instead of python3.

  1. Install Python (version 3.7+)

  2. Install Pip

  3. Install Pipenv via python3 -m pip3 install pipenv

  1. Clone the repositories into separate folders in the same directory:
    git clone
    git clone

    E.g., after running these commands in a folder called, for example, 'mkp', you should see:


  1. Run python3 -m pipenv install to install dependencies to the pipenv shell. If you get a version error, try python3 -m pipenv install --python [VERSION], where [VERSION] is the version of Python you just installed (e.g. 3.7.4).

  2. To enter the pipenv shell, run python3 -m pipenv shell. To exit, press ^D or type exit. Inside the pipenv shell, all outside dependencies for the repository are installed.

Helpful hint: If you just want to run a specific command (e.g. run a file) without entering the shell, use python3 -m pipenv run [COMMAND]. If you find yourself doing this often, consider adding an alias, e.g. so you can simply write: pipenv run python3

Interacting with the Manuscript object in Python

If you are a little savvy with Python, you can interact directly with the Manuscript in a Python interpreter. Open up the Python interpreter, Jupyter Notebook, or iPython in the manuscript-object directory and enter the following:

> from manuscript import *
> m = Manuscript(utils.ms_xml_path)

Now the Manuscript is held in memory with the variable name m. You can look at a particular entry like this:

> e = m.entries['tl']['p005r_2']

And you can inspect various aspects about it:

There are also several functions which are useful when interacting with entries:

> find_terms(e.xml, "env")
 'pierced door of a closed room',

With a bit of Python, you can make complex queries about the manuscript this way.

> for id, entry in m.entries['tl'].items():
    if len(find_terms(entry.xml, "env")) > 0:

Just like that, you get a list of all the entries with environment tags in them!

If we store some data in a list, we can plot the number of env tag occurrences by entry:

> import matplotlib.pyplot as plt
> ids = []
> n_terms = []
> for id, entry in m.entries['tl'].items():
    terms = find_terms(entry.xml, "env")
    if len(terms) > 0:
> plt.scatter(ids, n_terms)

scatter plot

With a little extra formatting, you have a visualization of roughly where env tags appear in the manuscript!

We see that entry 17r_1 has a ton of environment tags. Why is this?

> e = entries['tl']['p017r_1']
> e.title
On the gunner
['ditch casemates',
 'private houses',
 'small towns',
 'fortresses of little importance',
 'edge of the ditch',
 'garrets topped with a tower',
 'house or elsewhere',

So this is an entry discussing how gunners interacted with various environments in order to defend or attack them! It looks pretty long. How many characters is it?

> len(e.text)

Looks like a big number, but how much is that in context?

> lengths = [len(entry.text) for entry in m.entries["tl"].values()]
> average = sum(lengths) / len(lengths)
> average

Wow! Compared to the average, this entry is super long! But that doesn't tell us anything about the actual distribution.

> import math
> sd = math.sqrt(sum((x - average)**2 for x in lengths) / len(lengths))
> sd

Unsurprisingly, we have a pretty significant standard deviation.

> len(e.text) / sd

So the length of entry 17r_1 is 13 standard deviations from the average entry in the manuscript!

It's very easy to go from here to a simple histogram showing the length distribution:

> plt.hist(lengths, bins=100)
> plt.axvline(average, color="orange")
> for x in range(1,14):
    plt.axvline(average + x*sd, color="purple", linewidth=0.5)


The orange line is the mean; the purple lines are standard deviations. That tiny blue blip around 14000 must be entry 17r_1.

Admittedly, this sort of statistic is not terribly informative on this kind of dataset, but possibilities are abound. Interacting with the manuscript is made simple and powerful by holding the entries in memory as a Python object.