dictionaria / pydictionaria

Apache License 2.0
3 stars 0 forks source link

Refactor processing into cldfbench makecldf #24

Closed xrotwang closed 3 years ago

johenglisch commented 3 years ago

Hm, I am trying to wrap my head around the command-line interface for pydictionaria. Because of the decentralised CLDFbench approach a lot of the functionality needs to be rethought:

Also, if the *-intern repo does not contain any actual data, it might be worth thinking about getting rid of the submission vs submission-internal split in the folder structure:

+- contributions.json   # submission metadata
+- datasets/            # data downloaded from Github/Zenodo
+- etc/                 # any submission data not part of the cldfbench (whatever that might be)
   +- submission-id-1/
   +- submission-id-2/
   +- …

Then, moving a submission from --internal to published could be done by just flicking a boolean, e.g.:

{
    "contributions": [
        {
            "sid": "tseltal",
            "number": 10,
            "doi": "10.5281/zenodo.3668865",
            "repo": "https://github.com/dictionaria/tseltal",
            "published": true
        },
        …
    ]
}
xrotwang commented 3 years ago

I'd just remove check and ls, replace new and process as you say. And then refactor the rest as cldfbench subcommands contributed by pydictionaria - via the the cldfbench.commands entry point.

xrotwang commented 3 years ago

In particular for the concepticon integration, a cldfbench subcommand will make accessing the Concepticon data simpler.

IrenH commented 3 years ago

I think all of the functionality that check had is now included in makecldf and then some, isn't it? cldf.log and examples.log are far superior to the original 'check' output. I am just not sure whether the media lookup has also been included in the cldf.log, but I think so...? (missing XYZ)

johenglisch commented 3 years ago

So after this, there will be no dictionaria command anymore and the pydictionaria code won't even know what a dictionaria-intern even is (because that will only be relevant to the webapp)? I kinda like the sound of that. (^^)

@IrenH Yes, iirc the processing code will just dump any unknown filenames into the cldf.log.

xrotwang commented 3 years ago

Yes, no more pydictionaria cli. And maybe - at least as far as loading the web app is concerned - dictionaria-intern could be replaced by a simple JSON file, that may even live in clld/dictionaria.

IrenH commented 3 years ago

with this whole new workflow are we still using private repositories for new submissions? each dictionary is in its own repository now, right? so we just set it from private to public once it is published? or how will this work? thanks!

xrotwang commented 3 years ago

Yes. Upon submission a new, cldfbench-ready, private repos will be created, and made public upon publication.

IrenH commented 3 years ago

that sounds good!