GitHub top language GitHub issues GitHub last commit GitHub repo size

When in Rome

'When in Rome' brings together the world's functional harmonic analyses in encoded formats into a single, consistent repository. This enables musicians and developers to interact with that great body of work at scale, with minimal overheads.

In total, there are now approximately 2,000 analyses of 1,500 distinct works.

Additionally, 'When in Rome' provides code for working with these corpora, building on the music21 library for music analysis.

Is it for me?

This is best thought of as primarily a corpus of analyses which secondarily provide code for working with them and include the score where possible. I.e., the focus is on the analyses. There is a very great deal we can do with those analyses alone. Clearly there are also certain questions to require analysis-source alignment. We do our best to cater for that by including the score wherever possible, and as reliably aligned as possible (as anyone in the field knows, this is a significant challenge).

Maybe yes ...

'When in Rome' data is also used in external research projects and apps including the:

"Dezrann" app for visualising score-analysis and more (coming soon)
"Open Music Theory" (OMT) textbook's harmony anthology
"RAWL" app
"Sibelius" (Avid) notation software (by permission, coming soon)
["TiLiA" TimeLineAnnotator]() (coming soon)
"VIMU" (Visual Musicology) app via AugmentedNet

Are you using 'When in Rome' in a public-facing project? Let us know!

Maybe no ...

We're proud of how useful this is. All the same, it might not serve your needs. Might we suggest that if you're looking for:

scores only in a permissive licence, then try the OpenScore Lieder Corpus (1,300 songs, CC0 licence).
a small corpus of perfectly aligned scores + analyses (i.e., your priority is alignment, not overall size or diversity of content) then a single (not meta-) corpus like one of the DCML corpora listed below.

Corpus Directory Structure

Overall

<genre>/<composer>/<set>/<movement>/<files>

<genre>: A top level classification of the works by approximate genre or repertoire. As most corpora are prepared in relation to this categorisation, this top level division also reflects something of the corpora's origins. (For the avoidance of doubt, every analysis includes an attribution.)
<composer>: composer's name in the form Last,_First.
<set>: extended work (e.g. a song cycle or piano sonata) where applicable. Stand-alone scores are placed in a set called _ (i.e. a single underscore) for the sake of consistency.
<movement>: name and/or number of the movement. In the case of a piano sonata, folder names are generally number-only: e.g. 1. Most songs include both the name of the song and its position in the set (e.g. 1_Nach_Süden)
<files>: See the following sub-sections.

The Key Modulations and Tonicizations corpus is a slight exception: we preserve the organisation of that corpus by author, title, example number, e.g., Corpus/Textbooks/Aldwell,_Edward/Harmony_and_Voice_Leading/2a/. So the <genre> is Textbooks, the <composer> is the author, the <set> is the title, and the <movement> is the example number. We find this more logical that re-organisation by composer.

All folders include:

score.mxl or a remote.json file including links to external score files
- What: score.mxl is a copy of the score in the compressed musicXML format. This is provided for all new scores, as well as all originating elsewhere
  where that original is in a format which music21 cannot parse.
- How to use: Open in any software for music notation (e.g., MuseScore).
- Where there is no local score.mxl, there is a remote.json instead. Please note:
- This file points to an externally hosted score in a format which music21 can parse.
- This is designed to prevent duplication and automatically include source updates.
- Note that MuseScore files are included in a local conversion (.mxl) rather than remote.
  - This is because music21 cannot parse them and conversion requires the mscore package (see Code.updates_and_checks.convert_musescore_score_corpus).
- For downloading a local copy of remote files, see Code.updates_and_checks.remote_scores and the argument convert_and_write_local. Read those docs for details and warnings.
- Please check and observe the licence of all scores, especially those hosted externally.
analysis.txt
- What: A human analysis in plain text.
- How to use: Open in any text editor. You can also use these analyses as a kind of template for your own, by creating a copy and editing only the moments you disagree with.
analysis_automatic.rntxt.
- What: An automatic analysis made by AugmentedNet - a machine learning architecture which, in turn, is built on this meta-corpus' data.
- How to use: In exactly the same way as a human analysis, e.g., as a template (same format, same parsing routines).

Some folders include:

remote.json files
- What: this provides additional information about remote content including paths to external scores as discussed above.
- Additionally, we take the opportunity to provide metadata including composer name and one or more sets of catalogue information (Opus and/or equivalent).
analysis_<analyst>.txt
- What: An alternative analysis. This takes one of two forms:
- A copy of an original analysis exactly as converted for cases where significant changes have been made to that analysis. See, for example, this edit of this "original"
- A second analysis of the same work. The 'TAVERN' dataset includes pairs of analyses of the same work. In order to ensure there is exactly one analysis.txt throughout, we name the pair analysis.txt (note not analysis_A.txt) and analysis_B.txt.
- We likewise organise cases of two separate corpora of analyses of the same music this way.
  - The set which is complete takes precedence for the analysis.txt name.
- How to use: All such text files can be opened in the normal way. "Original conversions" serve as a point of reference for full disclosure on the conversion process.

Optional extra files (not included but easy to generate):

This repo. includes code and clear instructions for creating any or all of the following additional files for the whole meta-corpus, or for a specific sub-corpus.

The example folder contains all of these files for one example score: Clara Schumann's Lieder, Op.12, No.4, 'Liebst du um Schönheit'. Most of the variants derive from the options for pitch class profile generations, creating files in the form: profiles_<and_features_>by_<segmentation_type>.<format>

<and_features_> (optional) includes harmonic feature information. See notes at Code/Pitch_profiles/chord_features.py
<segmentation_type> options group by moments of change to the chord, key, or measure.
<format> options are .arff, .csv, .json, and .tsv.

Apart from these, the example folder also contains the files which are included in all folders by default (see above) as well as others that can likewise be generated across the meta-corpus:

analysis_on_score.mxl: the analysis rendered in musical notation alongside the score (as an additional 'part').
feedback_on_analysis.txt: automatically generated feedback on any analysis complete with an overall rating. Useful for proofreading. See Code/romanUmpire.py for more details on what it can and can't do.
<Keys_or_chords>_and_distributions.tsv: pitch class distributions for each range delimited by a single key or chord. See notes at Code/Pitch_profiles/get_distributions.py
slices.tsv and/or slices_with_analysis.tsv: a tabular representation of the score in 'slices' - vertical cross-sections of the score, with one entry for each change of pitch. This is useful for various tasks, both human (at-a-glance checks) and automatic (much quicker to load and process than parsing musicXML). The columns from left to right set out the:
- Offset from the start (a time stamp measured in terms of quarter notes),
- Measure number,
- Beat,
- Beat 'Strength' (from relative metrical position),
- Length (also measured in quarter notes),
- Pitches,
- and where the analysis is included, also Key, Chord
template.txt: a proto-analysis text file with only the metadata, time signatures, measures, and measure equality ranges as a template - i.e. all the information you need from the score with space to enter your own analysis from scratch.

This is clearly too much to include for every entry. Use the example folder to see the options and 'try before you' commit to a corpus-wide generation.

Corpus Overview

This corpus involves the combination of new analyses with conversions of those originating elsewhere.

Corpora originating elsewhere

Converted from other formats:

the DCMLab's standard (conversion code here):
- Beethoven string quartets (complete, 16 string quartets, 70 movements): originating from the 'ABC' corpus.
- Mozart Piano Sonatas (complete, 18 sonatas): originating from 'The Annotated Mozart Sonatas' corpus.
- Several collections including the Chopin Mazurkas (56 works): originating from DCML's 'romantic_piano_corpus'.
krn format (with thanks to @napulen):
- 27 sets of keyboard Variations by Mozart,_Wolfgang_Amadeus and Beethoven, from The 'TAVERN' project, (Devaney et al. ISMIR 2015)
- Haydn Op. 20 String Quartets: Complete annotations of Haydn's Op. 20 (6 string quartets, 24 movements), from the
  MTG dataset
- Key Modulations and Tonicizations: Modulation examples annotated from five music theory textbooks. Published in Nápoles López et al. 2020.
other:
- Beethoven Piano Sonata (complete first movements, 32 movements), from Tsung-Ping Chen and Li Su's 'BPS-FH' dataset, ISMIR 2018.

Analyses originally in the 'RomanText' format (no conversion needed), analysed by Dmitri Tymoczko and colleagues, and forming part of the supplementary to Tymoczko's forthcoming "TAOM", include:

Monteverdi madrigals: Complete scores and analyses for books 3–5 of the Monteverdi madrigals (48 works) also to be seen in this part
of the music21 corpus (but updated since that version).
Bach Chorales: 371 chorales, of which a subset of 20 was first released on music21.
Several further collections including a second set of analyses for most of the
ChopinMazurkas

Mixed sources

Several corpora have full or partial coverage from more than one source. The most complex case is the the Beethoven Piano Sonata collection for which there are 3 external corpora, all of them incomplete:

64 movements from DCML's 'romantic_piano_corpus'.
36 movements from Dmitri Tymoczko's TAOM collection
32 movements (complete first movements) as converted from the
'BPS-FH' dataset, ISMIR 2018.

There is not yet a single source for this collection. Are you tempted to attempt that? Do get in touch?

New corpora by MG and colleagues

Bach Preludes: Complete preludes from the first book of Bach's Well Tempered Clavier (24 analyses)
Ground bass works by Bach and Purcell.
Nineteenth-century songs: A sample of songs from the OpenScore / 'Scores of Scores' lieder corpus
(mirroring the public-facing score collection hosted here), including analyses for the complete Winterreise and
Schwanengesang cycles (Schubert),
Dichterliebe (Schumann),
and many of the songs by women composers that constitute a key part of and motivation for that collection.

Code and Lists

For developers, please see the individual code files for details of what they do and how.

Run code scripts from the repo's base directory (When-in-Rome) using the format:

>>> python3 -m Code.<name_of_file>

For example, this is the syntax for processing one score (feedback, slices, etc.):

>>> python3 -m Code.updates_and_checks --process_one_score OpenScore-LiederCorpus/Bonis, _Mel/_/Allons_prier!

Briefly, this repo. includes:

The Roman Umpire for providing automatic 'feedback' files. It takes in a harmonic analysis and the corresponding score to assess how well they match. Working in Harmony is an initial attempt at an interactive app for making use of this code online (no downloads, coding, dependency).
Anthology for retrieving instances of specific chords and progression from the analyses.
Pitch_profile for producing the profile and feature information discussed above.

Here are a couple of example of what all that can lead to:

A histogram of augmented chord usage in the lieder corpus ...

... and a histogram of fifth progression types across corpora:

Licence, Citation, Contribution

Licence

New content in this repository, including the new analyses, code, and the conversion (specifically) of existing analyses is available under the CC BY-SA licence (a free culture licence) except by arrangement. Please get in touch with requests for special permission.

For analyses that originated elsewhere and have been converted into the format used here, please refer to the original source for licence. Links are provided to those original sources throughout the repository including the itemised list above and within every analysis.txt file.

These external licences vary. As far as we can tell, all the content here is either original to this repo,
or properly credited and fair to use in this way. If you think you see an issue please let us know. Again, if you are simply looking for a scores in a maximally permissive licence, then head to the OpenScore collections which are notable for using CC0.

For research and other public-facing projects making use of this work, please cite or otherwise acknowledge one or more of the papers listed below as appropriate to your project.

Citation

Here's the best way to cite the code and/or corpus:

@article{gotham_when_2023,
    title = {When in {Rome}: a meta-corpus of functional harmony},
    shorttitle = {When in {Rome}},
    journal = {Transactions of the International Society for Music Information Retrieval},
    author = {Gotham, Mark and Micchi, Gianluca and Nápoles-López, Néstor and Sailor, Malcolm},
    year = {2023},
}

Alternatively, depending on the specific context, it may be appropriate to cite one of the papers using this data and functionality:

Syntax and Contributing

As the papers attest, harmonic analysis is fundamentally, necessarily, and intentionally a reductive act that includes a good degree of subjective reading. As such, these analyses are not in any sense 'definitive', to the exclusion of other possibilities. Quite the opposite: part of the point of having a representation format like this is to enable the recording of variant readings. Please feel free to re-analyse these works by using the existing analysis as a template and changing the parts you disagree with.

For minor changes, consider integrating your edits into the existing file using the variant (var) option that rntxt provides. E.g. m1 I b2 IV followed by a new line with m1var1 I b2 ii6
For more thoroughly divergent analyses, a new file may be warranted. In that case, perhaps credit the original analyst too in the format - Analyst: [Your name] after [their name]
For any cases of clear errors, please submit a pull request with the correction.

For more details of the RomanText format used to encode analyses here, see:

the technical specification paper, or
the relevant corners of the music21's
- code,
- module reference, or
- (if in doubt) user guide
this repository's own "quick start" guide to writing in RomanText.

MarkGotham / When-in-Rome

readme