The repository contains scripts and documentation for managing the multiple data sources for ALTLab's Plains Cree dictionary, which can be viewed online here. This repository does not (and should not) contain the actual data. That data is stored in the private ALTLab repo under crk/dicts
.
The database uses the Data Format for Digital Linguistics (DaFoDiL) as its underlying data format, a set of recommendations for storing linguistic data in JSON.
ALTLab's dictionary database is / will be aggregated from the following sources:
CW
)
MD
)
AECD
or AE
or ED
)
DLC
)
Also check out the Plains Cree Grammar Pages.
At a high level, the process for aggregating the sources is as follows:
The database is located in the private ALTLab repo at crk/dicts/database.ndjson
. This repo includes the following JavaScript utilities for working with the database, both located in lib/utlities
.
readNDJSON.js
: Reads all the entries from the database (or any NDJSON file) into memory and returns a Promise that resolves to an Array of the entries for further querying and manipulation.writeNDJSON.js
: Accepts an Array of database entries (or any JavaScript Objects) and saves it to the specified path as an NDJSON file.To build and/or update the database, follow the steps below. Each of these steps can be performed independently of the others. You can also rebuild the entire database with a single command (see the end of this section).
Download the original data sources. These are stored in the private ALTLab repo in crk/dicts
. Do not commit these files to git.
altlab.tsv
Wolvengrey.toolbox
Maskwacis.tsv
Install Node.js. This will allow you to run the JavaScript scripts used by this project. Note that the Node installation includes the npm package manager, which allows you to install Node packages.
Install the dependencies for this repo: npm install
.
Convert each data source by running node bin/convert-*.js <inputPath> <outputPath>
, where *
stands for the abbreviation of the data source, ex. convert-CW data/Wolvengrey.toolbox data/CW.ndjson
.
You can also convert individual data sources by running the conversion scripts as modules. Each conversion script is located in lib/convert/{ABBR}.js
, where {ABBR}
is the abbreviation for the data source. Each module exports a function which takes two arguments: the path to the data source and optionally the path where you would like the converted data saved (this should have a .ndjson
extension). Each module returns an array of the converted entries as well.
Import each data source into the dictionary database with node bin/import-*.js <sourcePath> <databasePath>
, where *
stands for the abbreviation of the data source, <sourcePath>
is the path to the individual source database, and <databasePath>
is the path to the combined ALTLab database.
You can also import individual data sources by running the import scripts as modules. Each import script is located in /lib/import/{ABBR}.js
, where {ABBR}
is the abbreviation for the data source.
Entries from individual sources are not imported as main entries in the ALTLab database. Instead they are stored as subentries (using the dataSources
field). The import script merely matches entries from individual sources to a main entry, or creates a main entry if none exists. An aggregation script then does the work of combining information from each of the subentries into a main entry (see the next step).
Each import step prints a table to the console, showing how many entries from the original data source were unmatched.
When importing the Maskwacîs database, you can add an -r
or --report
flag to output a list of unmatched entries to a file. The flag takes the file path as its argument.
Aggregate the data from the individual data sources: node bin/aggregate.js <inputPath> <outputPath>
(the output path can be the same as the input path; this will overwrite the original).
For convenience, you can perform all the above steps with a single command in the terminal: npm run build
| yarn build
. In order for this command to work, you will need each of the following files to be present in the /data
directory, with these exact filenames:
ALTLab.tsv
Maskwacis.tsv
Wolvengrey.toolbox
The database will be written to data/database.ndjson
.
You can also run this script as a JavaScript module. It is located in lib/buildDatabase.js
.
rm src/crkeng/db/db.sqlite3
pipenv shell
./crkeng-manage migrate
./crkeng-manage importjsondict {path/to/database.importjson}
--incremental
--no-translate-wordforms
./crkeng-manage runserver
./crkeng-manage buildtestimportjson --full-importjson {path/to/database.importjson}
pipenv run test
home/morphodict/altlab
) or your user directory. (It can't be copied directly to its final destination because you must assume the morphodict user in order to have write access to the morphodict/
directory.)sudo -i -u morphodict
/opt/morphodict/home/morphodict/src/crkeng/resources/dictionary/crkeng_dictionary.importjson
by copying it from the private ALTLab repo located at /opt/morphodict/home/altlab/crk/dicts
.cd /opt/morphodict/home/morphodict/src/crkeng/resources/dictionary
docker ps | grep crkeng
(docker ps
lists docker processes)docker exec -it --user=morphodict {containerID} ./crkeng-manage importjsondict --purge --incremental {path/to/database}
morphodict
user is required to write changes.src/crkeng/resources/dictionary/crkeng_dictionary.importjson
or some variation thereof.Tests for this repository are written using Mocha + Chai. The tests check that the conversion scripts are working properly, and test for known edge cases. There is one test suite for each conversion script (and some other miscellaneous unit tests as well), located alongside that script in lib
with the extension .test.js
. You can run the entire test suite with npm test
.