cldf-datasets / tangclassifiers

CLDF dataset derived from Tang and Her's "Quantitative typological data on classifiers and plural markers" from 2019
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Quick question on using this dataset in clld #8

Closed marctang closed 4 years ago

marctang commented 4 years ago

If I may use the opportunity to ask for more advice :-P, here it is: I am also planning to deploy some other databases with the clld framework. Now, thanks to you guys, I will be able to format myself first these databases according to the clfd format and add them in the cldf-datasets.

For using clld, I followed the explanation online and I can get the default example to work. However, I am less sure about how I could, for example, feed the tangclassifiers cldf dataset to it and deploy it online with clld? https://clld.readthedocs.io/en/latest/tutorial.html#populating-the-database Sorry in advance if it is a stupid question @@

xrotwang commented 4 years ago

I just recently added a bit more support for feeding clld apps from CLDF data. There's now a command clld create

$ clld create -h
usage: clld create [-h] [-f] [--quiet] outdir [variables [variables ...]]

Create the skeleton for a clld app project.

Variables:
- directory_name: The name of the project directory. This will also be used as name
  of the python package.
- cldf_module: If the app data is initialized from a CLDF dataset, specify the CLDF
  module this dataset conforms to (Wordlist|StructureDataset|Dictionary|Generic).
  Leave empty otherwise.
  Note that this requires passing an `--cldf` option to `clld initdb`.
- mpg: Specify "y" if the app is served from MPG servers, and thus needs to fulfill
  certain legal obligations (n|y).

positional arguments:
  outdir       Output directory. The last path segment will be used as default
               value for the'directory_name' variable.
  variables    If run non-interactively, defaults for the template variables
               can be passed inas 'key=value'-formatted arguments (default:
               None)

optional arguments:
  -h, --help   show this help message and exit
  -f, --force  Overwrite an existing project directory (default: False)
  --quiet      Run non-interactively, i.e. do not prompt for template variable
               input. (default: False)

which will create the skeleton for a clld app. To load the data into the db you'd run clld initdb

$ clld initdb -h
usage: clld initdb [-h] [--prime-cache-only] [--cldf CLDF]
                   [--concepticon CONCEPTICON]
                   [--concepticon-version CONCEPTICON_VERSION]
                   [--glottolog GLOTTOLOG]
                   [--glottolog-version GLOTTOLOG_VERSION]
                   config-uri

positional arguments:
  config-uri            ini file providing app config

optional arguments:
  -h, --help            show this help message and exit
  --prime-cache-only
  --cldf CLDF
  --concepticon CONCEPTICON
                        Path to repository clone of Concepticon data (default:
                        None)
  --concepticon-version CONCEPTICON_VERSION
                        Version of Concepticon data to checkout (default:
                        None)
  --glottolog GLOTTOLOG
                        Path to repository clone of Glottolog data (default:
                        None)
  --glottolog-version GLOTTOLOG_VERSION
                        Version of Glottolog data to checkout (default: None)
xrotwang commented 4 years ago

For simpler datasets - perhaps your classifier data - an online app could also be deployed via datasette-cldf - which would make hosting somewhat simpler because there's no need for a database server.

marctang commented 4 years ago

Thanks for the suggestions! I had a look at datasette. Since the database will have quite a lot of features to be added (WALS style), we would prefer to use the clld structure from the start, which will make the expansion easier later on.

A few quick confirmation-questions :-P, if I take tangclassifiers as an example:

marctang commented 4 years ago

another comment on the side that I said already to quite a few of your colleagues already: the structure of the cldf and clld are really nice and clean! Super thanks to making it available and nicely structured plus supporting the deployment of data on it :-)

LinguList commented 4 years ago

another comment on the side that I said already to quite a few of your colleagues already: the structure of the cldf and clld are really nice and clean! Super thanks to making it available and nicely structured plus supporting the deployment of data on it :-)

Nice to hear that. We need more people to spread the word, so we can teach each other how to apply it. That's why it is also important for us to help those interested in these initial conversions: so they can then teach others.

marctang commented 4 years ago

Well noted :-)! will do my best to spread the word and the method

xrotwang commented 4 years ago

@marctang regarding the clld initdb call: the cldf option should point to the metadata file, i.e. cldf/StructureDataset-metadata.json.

marctang commented 4 years ago

@xrotwang Thanks for the reply :-)

So, I opened a virtual environment and used 'clld create tangclassifiers' to create the skeleton, then from clld/tangclassifiers I tried the following:

clld initdb development.ini --cldf ~/Desktop/GitHub/tangclassifiers/cldf/StructureDataset-metadata.json 

which gave the following, telling me that the path to glottolog is required.

INFO    dropping sqlite:/db.sqlite
INFO    creating sqlite:/db.sqlite
Traceback (most recent call last):
  File "/home/marctang/.venv/bin/clld", line 11, in <module>
    load_entry_point('clld', 'console_scripts', 'clld')()
  File "/home/marctang/clld/src/clld/__main__.py", line 26, in main
    return args.main(args) or 0
  File "/home/marctang/clld/src/clld/commands/initdb.py", line 66, in run
    args.initializedb.main(args)
  File "/home/marctang/clld/tangclassifiers/tangclassifiers/scripts/initializedb.py", line 29, in main
    assert args.glottolog, 'The --glottolog option is required!'
AssertionError: The --glottolog option is required!

So, I tried cloned glottlog's GitHub repository https://github.com/glottolog/glottolog and tried the following:

clld initdb development.ini --cldf ~/Desktop/GitHub/tangclassifiers/cldf/StructureDataset-metadata.json --glottolog ~/Desktop/GitHub/glottolog/

Which gives the following error. I guess I made something wrong in the settings or forgot to do something, e.g., should I clone the repository from Zenodo instead of GitHub or change something in the py files? Could you point me in the right direction? Thanks!

INFO    dropping sqlite:/db.sqlite
INFO    creating sqlite:/db.sqlite
Traceback (most recent call last):
  File "/home/marctang/.venv/bin/clld", line 11, in <module>
    load_entry_point('clld', 'console_scripts', 'clld')()
  File "/home/marctang/clld/src/clld/__main__.py", line 26, in main
    return args.main(args) or 0
  File "/home/marctang/clld/src/clld/commands/initdb.py", line 66, in run
    args.initializedb.main(args)
  File "/home/marctang/clld/tangclassifiers/tangclassifiers/scripts/initializedb.py", line 84, in main
    key=lambda v: (v['parameterReference'], v['id'])),
  File "/home/marctang/clld/tangclassifiers/tangclassifiers/scripts/initializedb.py", line 20, in iteritems
    cmap = {cldf[t, col].name: col for col in cols}
  File "/home/marctang/clld/tangclassifiers/tangclassifiers/scripts/initializedb.py", line 20, in <dictcomp>
    cmap = {cldf[t, col].name: col for col in cols}
  File "/home/marctang/.venv/lib/python3.6/site-packages/pycldf/dataset.py", line 565, in __getitem__
    raise KeyError(table)
KeyError: 'CodeTable'
xrotwang commented 4 years ago

Ah, ok. The problem is the lines in scripts/initializedb.py, created from this template code https://github.com/clld/clld/blob/03e465c00bfddf1ac3e363d9db2e44609debc116/src/clld/project_template/%7B%7Bcookiecutter.directory_name%7D%7D/%7B%7Bcookiecutter.directory_name%7D%7D/scripts/initializedb.py#L111-L128

It assumes a StructureDataset with a Codes component. There's two ways around this:

I think - while seemingly overkill - the first option makes more sense - and is more transparent. It also allows WALS-like display of feature values on a map with distinctly colored dots, etc.

Of course, the function in initializedb.py could figure out if there is a CodeTable, before trying to read it - but that's python code created from a template - and such code is a bit difficult to write, test and debug, so should better be simple. Since there typically is no way around customizing initializedb.py at some point, I thought leaving the default simple - but sometimes, as here, dysfunctional, was an acceptable decision. What do you think? Too much of a turn-off?

xrotwang commented 4 years ago

@marctang I'll put together a PR tomorrow, adding a CodeTable here. Then import into the app should work.

marctang commented 4 years ago

@xrotwang Thanks again for the explanation :-) I agree that adding the CodeTable would be more transparent and better for futur development too. Let me know if there is anything I can help with. I could potentially add that table with R, but it is probably better if everything is done with the same code based on your PR tomorrow. Thanks again for your help!

xrotwang commented 4 years ago

@marctang just added a CodeTable (see https://github.com/cldf-datasets/tangclassifiers/commit/de6e87bda980ee77423d87723f9fba376c6f01ec#diff-d9858e8d7e38dec098e24d087bd3c536). With this output, clld initdb works (on my end). Will put together a recipe in the CLDF cookbook how to do that.

xrotwang commented 4 years ago

@marctang regarding CLDF an R: @SimonGreenhill has an R library for reading CLDF. Fleshing this out to have the full scope of functionalities of pycldf might be cool. But to be honest, I'd prefer even R users learn the bit of python necessary to read and write files, since for these things python seems to be the more mature language (e.g. regarding Unicode, paths on different OSs, etc.).

marctang commented 4 years ago

@xrotwang Super thanks! Well noted for the R library. I will do what you suggest and get familiar with both Python and R for this. Thanks also for adding the CodeTable, like this I'll be able to add it for other datasets in the future too :-) I just tested clld initdb with pserve --reload development.ini to test it and it works! Yay!

Two more questions though @@. Sorry again for being annoying, I'll pay back by doing my best to teach other people how to do this so that you don't get swarmed by the same questions.

1) for CLDF: the feature values seem to be duplicated. For both features (sortal classifier and morphosyntactic plural), the value became the one of sortal classifiers, e.g., in the original data, Ainu and Abun are sortalclassifiers = yes and morphosyntacticplural = no, but in values.csv they are sortalclassifiers = yes and morphosyntacticplural = yes. When I checked the details in values.csv, it seems that the morphosyntactic plural feature has exactly the same values as the sortal classifier feature.

2) for CLLD: in the html deployed locally, everything works, except the source part, in which the references do not show up. Do you have any suggestions as for where I should look at ? for now what I did is clone from git clone https://github.com/clld/clld.git, and follow the tutorial steps to create the skeleton and fill it with clld initdb with pserve --reload development.ini. I did not modify the other files in the repository.

Screenshot from 2020-06-11 09-07-58

and when I click on an individual one, I get:

Screenshot from 2020-06-11 09-08-16

xrotwang commented 4 years ago

@marctang re CLDF: Good catch! Will check and add some consistency checking (that's what test.py is for, which is particularly important for larger datasets where eye-balling isn't an option anymore).

xrotwang commented 4 years ago

@marctang re sources in clld app: See https://github.com/clld/clld/issues/212

xrotwang commented 4 years ago

@marctang At https://github.com/cldf-datasets/tangclassifiers/commit/5a7387860bb17a803bc372bc236d1d7cdbf41729#diff-b284a28710cce90d9d9be3a7f4cabc8e you can see an example how you'd do some consistency checking for the CLDF data. This is mainly to prevent regressions introduced by fiddling with the code in cmd_makecldf.

https://github.com/cldf-datasets/tangclassifiers/commit/5a7387860bb17a803bc372bc236d1d7cdbf41729#diff-354f30a63fb0907d4ad57269548329e3 also hooks these tests up with travis, i.e. whenever a change is pushed to the repository, the committer will get an email notification if tests no longer pass.

marctang commented 4 years ago

Thanks! The output is correct now! Well noted also for the examples of checking the data. For the sources in clld, I replied in the other issue you opened, so I'll close the current issue.

@marctang At 5a73878#diff-b284a28710cce90d9d9be3a7f4cabc8e you can see an example how you'd do some consistency checking for the CLDF data. This is mainly to prevent regressions introduced by fiddling with the code in cmd_makecldf.

5a73878#diff-354f30a63fb0907d4ad57269548329e3 also hooks these tests up with travis, i.e. whenever a change is pushed to the repository, the committer will get an email notification if tests no longer pass.

xrotwang commented 4 years ago

@marctang If you find the time, you could review https://github.com/cldf/cookbook/blob/master/recipes/clld/README.md - which should be rather similar to what you did to get started with clld.

marctang commented 4 years ago

Awesome! That's indeed quite similar and even faster in a way since it is directly from the CLDF online data! Thanks!