mk-file to download additional data

cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus

https://doreco.info/

3 stars 0 forks source link

mk-file to download additional data #4

Closed FredericBlum closed 1 year ago

FredericBlum commented 2 years ago

I am thinking what might be the best way to add the annotations that are restricted by the ND license. @xrotwang Is there an easy way to create a mk-file that downloads the respective files and converts them into CLDF, once the cldfbench-workflow is done? This will probably be more relevant for my study than for this CLDF dataset, as we cannot publish this data as CLDF.

FredericBlum commented 2 years ago

The following languages have a restrictive license:

Cabécar
hoocak
light warlpiri
nisvai
northern alta
urum
warlpiri
yutactec maya
yuracaré

xrotwang commented 2 years ago

I think this would be the job of pydoreco. So what I'm imagining is a cldfbench enabled repos containing a list of relevant corpora - e.g. in etc/corpora.csv - and for your study you could

fork this repos
adapt etc/corpora.csv, adding the restricted corpora
run cldfbench makecldf locally and do your analyses with the resulting CLDF.

FredericBlum commented 2 years ago

Do I understand the code frm #10 that the data is never uploaded to github, so we could just remove the if-clause about the ND-license? As you suggested in the previous answer, cldfbench makecldf will have to be run locally anyway (also due to file size of raw- and cldf-tables). But we could still have the metadata and the code ready, without having to include two different setups for the different licenses.

xrotwang commented 2 years ago

No, I would want to upload the CLDF data to github - but this will likely require zipping a couple of files. Once this has been implemented/documented, we should only add the unzipped paths to gitignore. So, no, we couldn't include the ND annotations in this scenario. But as you say, there could be some sort of switch, allowing to run the CLDF creation locally including all annotations.

FredericBlum commented 1 year ago

Could this just be an if-clause within the lexibank script that can be switched on with some variable?

If there is a concrete way how I can support this, I'll happily do that.

xrotwang commented 1 year ago

The problem is that the call interface for the makecldf command is controlled by cldfbench. So, the next best thing may be an environment setting, i.e. calling

export DORECO_FULL=1
cldfbench makecldf ...

and checking in cldfbench_doreco.py

import os

...
    if os.environ.get('DORECO_FULL') == '1':
        ...

FredericBlum commented 1 year ago

We could also set the variable directly in the cldfbench script, right? We did that for the dictionary- and wordlist conversion in other repositories. Exactly the same, but we wouldn't have to add the switch in the environment setting every time we run the cldfbench.

xrotwang commented 1 year ago

You mean interactively, i.e. prompting the user for input? Yes, that's an option, too. May be a case for https://github.com/clld/clldutils/blob/c7293255d516995d06fae07124f7d81731ace815/src/clldutils/clilib.py#L177

xrotwang commented 1 year ago

See https://github.com/cldf-datasets/doreco/blob/main/cldfbench_doreco.py#L80-L112