Factor out pyclts - Githubissues

xrotwang commented 4 years ago

So while we are at making breaking changes elsewhere, why not

factor out pyclts to its own repository, trying to cleanly separate data and code
move these repositories into their own github org

LinguList commented 4 years ago

We discussed this, didn't we? The problem is that the code and data are specifically intertwined here, so finding a way to circumvent this is almost impossible.

tresoldi commented 4 years ago

I would be in favor, separating data and code makes a lot of sense for CLTS (not sure if it "deserves" its own organization, however). It could certainly attract more phonologists/phoneticians.

However, as @LinguList just pointed, the code and data are somewhat co-dependent and would take some time to separate them. Should we have this as a low priority?

LinguList commented 4 years ago

I guess it depends on how fast we have a working example here.

As pyclts is deeply intertwined in lexibank, and I also use it in other applicattions, it is a great advantage of not having to configer the path where the package lies, which is one of the advantages of not separating code and data.

I'd say: if somebody can provide an example for how this could be achieved, one could review that and see how well it works, but I would not want to loose some of the functionality we have there by now, since specifically the adding of new transcription datasets is quite tedious, and the code written there should not be thrown away when changing the structure here.

xrotwang commented 4 years ago

I'll try to come up with a solution. While having data in the package may be simpler in applications, the fact that CLTS is different when compared to Glottolog and Concepticon makes it more complicated as well :)

I also think it's a bit mis-leading if code-bugfix-releases litter the data release timeline.

tresoldi commented 4 years ago

By the way, maybe we should separate not only code from data, but app from library as well? In terms of repository, I mean.

SimonGreenhill commented 4 years ago

Note also that the code in ./app/ is py2 only so that should be updated if it's being refactored.

LinguList commented 4 years ago

Ah, sorry, that's dead code, it can be savely removed, all the py2 code, as it was just used for a cgi server that only runs py2.6

xrotwang commented 4 years ago

So here's the state of affairs:

GitHub org is here: https://github.com/cldf-clts
pyclts is here: https://github.com/cldf-clts/pyclts (wrong REAMDE.md at the moment, will fix)
data is here: https://github.com/cldf-clts/clts

What I did:

move the data directories from https://github.com/cldf/clts/tree/master/src/pyclts to https://github.com/cldf-clts/clts/tree/master/pkg
copy the remaining directories either to pyclts or clts
rewrote the tests such that they only test the code (and a mock data repository at https://github.com/cldf-clts/pyclts/tree/master/tests/repos)
the tests which actually tested the data can now be run with clts --repos ../clts/ test. Running this command currently leads to
```
WARNING lˠ    Sound not generated 
WARNING ᴀ̃/ã̱    Not matching codepoints
WARNING ï    Sound not an alias
```

the commands to create app and package data still exist and do the right thing now.

running clts make_app results in the following diff, probably due to newer lingpy:

diff --git a/pkg/soundclasses/lingpy.tsv b/pkg/soundclasses/lingpy.tsv index db29f90..c2467c3 100644
--- a/pkg/soundclasses/lingpy.tsv
+++ b/pkg/soundclasses/lingpy.tsv
...
-rounded close with-frication central vowel     ʯ       0       0       0       0
0       0
+rounded close with-frication central vowel     ʯ       Y       V       7       V
i       #c86464^M
...

The convenience of basically importing TranscriptionSystem('bipa') is gone, of course, but it isn't too hard now, either:
```
>>> from clts import CLTS
>>> CLTS('path/to/clts').bipa
```

Overall, I still like this setup a lot more. Everything seems a lot more where it belongs:

pyclts/tests actually test the code, not the repository data
the convolution of using pyclts to create parts of pyclts is gone
the double purpose of pyclts to curate clts as well as provide read access to it is still there, but that's ok, and can be expected since it's the same for the other catalogs as well.

LinguList commented 4 years ago

I'll need to have a closer look at everything later, and I am generally okay with that, but I'd say that it is extremely important to find a way to globally fix the path problems, as this is now making it already very difficult to present the code in tutorials (also for concepticon).

either I tell users to git-clone from scratch, which is getting repetitive, and make sure things are somewhere next to each other
or I have to fix the path, which is also very problematic for tutorials, or for writing a paper that uses this code, and which would have to submit all of it

So the general advantage of installing clts and having a tool set for handling with linguistic data for coding is definitely not there at the moment.

While I agree with the advantages of separating the two, I'd suggest to make a compatibility package on top, say, one command that allows me to make an init file, as we do now for cldf-bench, and stores this, so we can, upon installation or first run of pyclts, ask the users to provide their repo, and pyclts would call it and raise an error otherwise.

If we have this initialization-library or script generic enough, we could then also use it for pyconcepticon and pyglottolog (specifically pyconcepticon gives me a difficult time now, I started copy-pasting parts of it to people who wanted to work with it as they did not see that they had to change the path). And we could use that script then also in cldf-bench, checking if paths are properly set.

So to summarize: I felt a major inconveniency since the pyconcepticon was introduced, but I think it can be addressed, and I think it should be addressed, specifically as I saw many users (also in our group) ending up confused. And I'd argue that we should not leave it to cldf-bench but have a light-weight init-organizer on top (that could also be used from lingpy, where we also have some paths and data).

LinguList commented 4 years ago

Ah, in some sense, you can say this is similar to nltk and nltk_data now: users can do something with nltk, but in fact, they need to decide also what data they want.

LinguList commented 4 years ago

Or can we add a little bit of code to concepticon-data, to clts-data, and to glottolog-data, so that it could be shared on pip as well (maybe in condensed form), so that upon installation, it would handle the path problem? I could then import pyclts and also clts_data, which is, nothing more than a path to the data?

xrotwang commented 4 years ago

I'm not sure I like the compat-package idea. To me it seems as if most of the path problems come from the fact that so far we tried to hide the fact that an API needs a path at initialization. So the convenience of inferring default locations, config files, etc. led to intransparence. Since there are many ways of trying to be smart about discovering default locations, and we tried a couple, things broke - not passing a path worked in some situations and didn't in others.

So my idea for a better world:

Explicit is better than implicit.
Make it known more aggressively that "using a catalog" means having a (potentially long-lived) clone of the repsitory locally.

Also, I think that with the convenience before, we basically traded "path problems" for "sync problems" - which don't show up as Exceptions, typically.

xrotwang commented 4 years ago

Regarding making catalog data installable via pip: I think that adds again a layer of obscurity. Also, catalog data will typically be used across multiple virtual environments. With the proposed method of discovery that would mean install the catalog for each env, making sure it's always installed from the same place on disk - so even more potential for confusion.

tresoldi commented 4 years ago

The difference is that nltk does not depend on nltk_data, while libraries like pyclts are effectively useless without the accompanying data (even though one could technically change the code and load only some personal IPA extension, without BIPA).

Maybe it is not the most elegant solution, but I still prefer to have a data or resources folder, install it alongside the code (having the package in setup.py, setting as not zip_safe, and the files in the manifest), and loading data transparently with some _RESOURCE_DIR=path.join(path.dirname(path.dirname(__file__)), "resources") magic.

LinguList commented 4 years ago

But the problem of explaining non-expert python users that they need to git-clone, download, and pip-install things at the same time is too much for many people, and it is also getting too much for me, as I had many of these requests in the past. So even if one is explicit, users may not necessarily see that. Also based on the tutorials I wrote in the past, it became quite difficult that one can no longer write a Python script that just runs, provided data are installed, but that one has to explain that users should change path-names according to their machines there.

We may not need to install things via pip, I mean, the data, but then we need to think of adding functionality as they offer for nltk_data, where the data is downloaded directly, i.e., users don't need to run through git clone.

LinguList commented 4 years ago

Without making it a bit easier to manage data, and to store where the data is in an init file, I see the follwoing problems:

tutorials become tedious to write, as scripts don't work without users changing, and users don't like changing when they run a tutorial the first time
code for replicability needs to either submit all data every time, to keep relative paths possible, or it will loose replicability
we have an increase in emails form people who fail using the tools

And all are cases that actually happen, and quite a lot.

So if we imagine a simple package that downloads the data and stores the paths, pointing to the latest releases, so it would download the data the first time for the users, and remember where they were put, this may already be of help.

In fact, what we have in cldf-bench is largely doing the trick, and I ended up using from pylexibank.__main__ import configure and then pulled concepticon and glottolog from there, since it was so convenient, but one would like the same functionality independent of cldf-bench/lexibank.

xrotwang commented 4 years ago

I'm not sure such a package will be simple. As explained here, such code has the potential of masking/hiding problems beyond our control - such as missing git executables, problems with internet connectivity, etc. If this happens, we add even more confusion. I would have hoped that

telling people to either clone or download a GitHub repository and
expecting them to find their way around their local file system

would be a reasonable minimal requirement.

LinguList commented 4 years ago

It is a major hurdle that prevents people from coding, I am afraid, and makes it more difficult for us, as we will have to keep explaining what people should do, etc. And the article-submission-with-code is a similar case.

In fact, if one says: "assuming you have set up everything correctly [instructions here ...]" I'd just like to avoid that I have to write a "YOUR PATH HERE" into my python scripts.

So if a user downloads clts-data, why not add a simple setup.py that registers a path to the repository? Or using the cldf-bench-configuration and making it independent of cldf-bench?

chrzyki commented 4 years ago

I'm fully with @xrotwang here. I think that this particular issue (code/data separation and accessibility) are something we've got to handle with good tutorials and documentation, as well as examples and cookbooks. Moving this pain into just another Python library is in my opinion just asking for more trouble further down the road.

Re "YOUR PATH HERE": There are multiple ways we can solve this, e.g. configuration files (see pylexibank for script, notebook, and tutorial environments) or cli arguments.

xrotwang commented 4 years ago

Ok, so what exactly can we expect? From what I understand, we cannot expect people to clone repositories. If so, how should people get the catalog data? I'd guess, if we cut git out of the picture, we should do the same with GitHub, thus direct people to download released catalog versions from Zenodo. Correct? If we limit the requirements for some sort of package to

list catalog versions on Zenodo
download a catalog version from Zenodo
make this download known to processing software (presumably using some config file (to be platform agnostic))

then I'd be somewhat optimistic that we can do that.

But the price that comes with it is that we disconnect the catalogs from git, which will make it harder for people to actually collaborate - which at least for Glottolog I would hope for.

SimonGreenhill commented 4 years ago

Personally, I've found running out of a git clone leads to syncing issues (do I have the latest commit, right branch, etc?). I can remember to git pull; git checkout master but I suspect most of our collaborators will find that complicated.

Sorry for jumping in here, but I like the idea of first-run (or an explicit .install(version=x) command) to download the version into an app dir. The user just then needs to know what version of the data they're using, which is good. I'd rather they were explicit about the version of the data rather than the path to the data :)

xrotwang commented 4 years ago

@SimonGreenhill so that sounds like a "yes" to my comment above? Some tool that

knows what's released (by looking up Zenodo)
knows how to download that stuff into appdirs.user_config_dir.cldf / <catalog> / <version> - maybe?
writes a config file with these paths in addition?

xrotwang commented 4 years ago

@SimonGreenhill I'd be sort-of ok with that. The price - again - is that people won't be set up to collaborate on the catalogs - and what could be worse - people will use a different tool than we (or at least I).

SimonGreenhill commented 4 years ago

Yes, I think that's the easiest solution??

I've had a few troubles recently with collaborators not knowing which version of the data they're using ("I downloaded glottolog/dplace a few months ago, has it been updated?", "I clicked download on the website. Where do I put it?"). Something like this would mean they have to explicitly ask for version x, so they must know what version they're using, but the mechanics of that are abstracted from them (and I'd rather our collaborators outside the core team are using explicit released version rather than git revision xyz).

It is annoying that they won't be able to PR as easily, but perhaps a 'dev' version could run a git clone (but that adds more complexity back). We could also flag in README and messages e.g. "These data are $Package $Version from $URL. To add or correct please see $URL".

xrotwang commented 4 years ago

Ok. Say we go with this idea, I'd propose that cldfbench would be this tool. Admittedly, it will pull in quite a few dependencies not needed for this particular task (although we can probably make a couple of these optional), but having yet another tool seems overkill.

xrotwang commented 4 years ago

This would also mean that clldutils.apilib.API would gain a factory method like

@classmethod
def from_name_and_version(cls, name, version):
    ...
    return cls(data_path)

which looks up a cldfbench config file, I guess.

xrotwang commented 4 years ago

And I guess the Catalog class in https://github.com/cldf/cldfcatalog should then also be able to deal with exports/Zenodo downloads of the catalog data - and not expect a git repository that can be queried for metadata using git describe, git tags and git remote.

SimonGreenhill commented 4 years ago

I think it makes sense for cldfbench to have it as an 'install catalog' idea. The Zenodo API looks flakey (most options incl. list releases are 'under development', so it may be easier to use github in which case the catalog could just run git clone on a specific tag?

xrotwang commented 4 years ago

@SimonGreenhill I'm confused now. Didn't we want to cut git out for simplicity? I don't think running git clone from GitPython is the solution to do this - we'd still have to deal with "is git installed?" issues - just obscured by two layers of Python code.

Regarding Zenodo: I think the OAI-PMH interface they offer for communities is stable - and this would be all we need. We'd make sure to add catalog versions to a specific community, then go through OAI-PMH to discover these.

LinguList commented 4 years ago

Hi all, I would like to say: I am fine with git, and users are also nowadays, I think, but I am not fine with having all our personal ways to remember paths to concepticon and pyclts: we need a general config, so I can write code, and, e.g., Mei-Shin is pulling it, and applying it, without changing the path in the script. Yet that concrete and maybe seemingly silly problem is the biggest burden for me.

LinguList commented 4 years ago

One of the reasons why lexibank-repos are so convenient is that we can work on a repo, modify it, and then acccess it again, without having the data in the same place. Again: a functionality that is shipeed with the lexibank-setup.py scripts. I can ask people to follow instructions to install sagartst, and then once I assume that (them following tutorials etc. on how to do it), I can send them snippets with code that work on any folder of their directory, without having to adjust paths.

That's my major issue.

I think the easiest way to solve it is to add a setup.py to concepticon-data, clts-data, etc. so I have access to it's path. I mean: we do the same with cldf-datasets, right? Why not do it with our catalogs?

tresoldi commented 4 years ago

Just to make my suggestion clearer with an example:

put all data in a resources folder in the same git repository
in setup.py, add that directory as one of the packages to be included in the wheel along with the code (there are many ways of doing it, but the dumbest is to just manually list, i.e., packages=["pyclts", "resources"]) -- this also implies that the package won't be zip safe, which can be set with zip_safe=False
add all the files to the MANIFEST.in (just an include resources/* line)
in the code, functions would read and return the data, either internally or exposing to the user, as in this pseudo-code (here only reading vowels):

import csv
from os import path

def read_ts_vowels(name):
  filename = path.join(path.dirname(path.dirname(__file__)), "resources", name, "vowels.tsv")
  with open(filename) as csvfile:
    reader = csv.DictReader(csvfile, delimiter="\t")
    data = [row for row in reader]

  return data

The functions above would be called internally, so that the API does not change (i.e., things like bipa = TranscriptionSystem("bipa") would still work), and if people want access to them they just need a pyclts.read_ts_vowels("bipa")

This is what I use in most of my projects, such as the alteruphono sound changer https://github.com/tresoldi/alteruphono . I find it easier because code and data are separated but in the same repository, when releasing a wheel it would be installed automatically with pip, allowing easier branching/versioning as well as testing.

LinguList commented 4 years ago

Please check, for convenience, this tutorial on clts I wrote one month ago. My code says:

from pyclts import TranscriptionSystem
bipa = TranscriptionSystem('bipa')
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
    print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-'))

Now all I want to keep is that I never have to put in my tutorials or code I write with people together, a direct path from my system or a relative path. So I am fine with a solution like:

from pyclts import TranscriptionSystem
from cltsdata import data_path as clts_path
bipa = TranscriptionSystem(clts_path('bipa'))
sound = bipa['ts']
for i, (k, v) in enumerate(sorted(sound.featuredict.items())):
    print('{0:5} | {1:22} | {2:10}'.format(i+1, k, v or '-'))

But I think we need at least this functionality for all of our catalogs.

LinguList commented 4 years ago

So, one more attempt, to be precise, @chrzyki said:

Re "YOUR PATH HERE": There are multiple ways we can solve this, e.g. configuration files (see pylexibank for script, notebook, and tutorial environments) or cli arguments.

And this is exactly, where I want ONE solution that handles this for all catalogs, so we can use it when teaching and when collaborating on our tools.

xrotwang commented 4 years ago

@tresoldi distributing data in python packages would work with CLTS, but I wouldn't bet on pypi to host 0.5 GB python packages for the Glottolog data. It also means different sets of instructions depending on the platform you are using (pypi won't work for R), more complicated releases (no release when pypi is down), etc.

xrotwang commented 4 years ago

@SimonGreenhill @LinguList So, maybe, having cldfbench config take a stab at running git clone would be the cheapest first stab at a solution? Maybe I'm overestimating the problems - I don't know how ubiquitous git is nowadays, and internet tends to be fast everywhere.

This would then make appdirs.user_config_dir('cldfbench') / catalog.ini the standard location to look up such paths from all packages, pyclts, pyglottolog, etc.?

xrotwang commented 4 years ago

Maybe make it appdirs.user_config_dir('cldf') / catalog.ini?

xrotwang commented 4 years ago

Looking at https://github.com/gitpython-developers/GitPython/issues/26 , GitPython seems to work reasonably well with Windows - and I guess MacOS won't be too problematic. So, @SimonGreenhill do you think we can/should expect users to have git installed? I like this solution, since it provides for a smoother learning curve: Once you know a enough about git, just consult the config file for locations of your clones and be ready to contribute to glottolog - not quite - because you won't have a fork ... Anyway, I like it.

tresoldi commented 4 years ago

@xrotwang I was proposing it for pyclts, with the issue of intertwined code & data pointed by @LinguList -- it can be hard enough to explain transcription systems and trascription data, adding a path configuration would make it even worse for beginners. As an API approach (not passing a clts_path or similar), I would still favor such solution, even with data downloaded manually or automatically.

No releases when pypi is down does not seem a problem to me, but you are completely right about the interoperability issues with R, as we'd make harder to actually reuse data. I didn't consider it. But again, either we keep all the data in an independent repository that needs to be cloned somehow, or periodically release the data somewhere so it can be download when needed by the Python and R libraries (hopefully in an automatic way, similar to the nltk approach).

xrotwang commented 4 years ago

@tresoldi yes, I agree that considering pyclts in isolation looks somewhat differently. But the power-users of all these tools will be us, and most of us are involved in the curation of more than this one dataset/catalog. And to me, homogeneity of these datasets and tools is mission-critical. If you come back to pyclts after a couple of months (as I did this weekend), and what you know boils down to "it's different" and "it's intertwined", that's not a good situation. So, I'd ask to sacrifice a bit of end user experience in order to keep us sane.

LinguList commented 4 years ago

The path statement, as long as it is configurable, and we have a tutorial that runs users through it, so I can point them to it, is for me the most burning issue, as its absence does not only break UI, but also current collaboration (we're using the pylexibank.__main__.configure for the time being in our group). But I guess we start agreeing on a solution here, right?

tresoldi commented 4 years ago

@xrotwang to a more practical matter, couldn't we just build the wheels for the data as normal Python packages (like a pyclts_data) and host them somewhere else then pypi? With the immediate first alternative of GitHub itself, even if it's half a Gb.

Having the data as normal packages, with version number, dependencies, etc., would worry me less about problems for debugging cases where the code and the data are out of sync, or people manually sharing zip files with the entire data, and the installation should be easier (as pip would and should take care of downloading and installing everything).

xrotwang commented 4 years ago

@tresoldi I'm really not too keen on providing yet another (in this case even platform-specific) release version of our catalogs, in addition to the one on Zenodo, the one on the GitHub release page and a clone with the proper tag checked out. I see that it would be helpful, if pyclts could declare "works with data better than v1.3", but then I also think being very conservative with backwards compatibility for the data is good anyway, and that makes this point less relevant.

@LinguList Here's my (slightly biased) understanding of what agreement could look like:

We expect users to have git installed.
We expect users to install cldfbench (hopefully with stripped down requirements).

Then they'd run cldfbench config to

either point to clones or create new ones in appdirs.user_config_dir('cldf') / catalogs,
register these paths in appdirs.user_config_dir('cldf') / catalog.ini.

Other packages would consult this config file for default locations. To make this as simple as possible, I'd propose to add code to do this to cldfcatalog, which is very lightweight and can just be added as dependency to all packages. (I wouldn't want to add a separate CLI to cldfcatalog, though, maybe a bit of documentation how to write such a config file by hand.)

That leaves the question of how to handle different versions of the data. With cldfcatalog functionality it's easy to do stuff like:

with Catalog('clts', 'v1.0'):
    ...

but the question is

should this be done outside of instantiating the particular API - e.g. in a wrapper script?
```
with Catalog('clts', 'v1.0') as cat:
  clts = pyclts.CLTS(cat.dir)
```

or should the API know how to do this, e.g..

with pyclts.CLTS(version='4.0') as clts:
  ...

The first option would be simpler and compatible with current versions of the packages. The second option would have the advantage that the package knows which version of data it deals with - so could possibly adapt to differences.

tresoldi commented 4 years ago

+1 for the first option, of instantiating the API.

xrotwang commented 4 years ago

I think I'm for option 1 as well, since this would mean the API just expects a directory - wherever that came from - git export, git clone or unzipped zenodo download. Also, the APIs are typically used from other python code - the exception being user facing code like the concepticon CLI. But for this, future releases of pyconcepticon could just include a small wrapper - exposing a --version option and wrapping API instantiation into the with statement as above.

xrotwang commented 4 years ago

@SimonGreenhill @tresoldi @LinguList @chrzyki I propose to leave this issue now to @LinguList answering to https://github.com/cldf/clts/issues/137#issuecomment-544360176 and refer all further discussions to a PR I'm going to open on https://github.com/cldf/cldfbench

xrotwang commented 4 years ago

one last question, though, before I hack away: clones in appdirs.user_config_dir or appdirs.user_data_dir? (see https://pypi.org/project/appdirs/)

LinguList commented 4 years ago

I'd say: config. As these are config files, right?

xrotwang commented 4 years ago

Unless there are some expectations on some platforms that "config" must be small, I'd say config, too.

LinguList commented 4 years ago

As to #137: I am fine with this suggestion, just had a quick check, and think this will (provided the paths are solved, even increase convenience.

cldf-clts / clts-legacy

Factor out pyclts #137