acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
371 stars 251 forks source link

PyPi package #913

Open mbollmann opened 3 years ago

mbollmann commented 3 years ago

There was some discussion on whether we should make our anthology library into a PyPi package. This would make it easier for people to use our Python interface to the Anthology, e.g., to build external tools or run analyses. It might even encourage people to contribute and add functionality to the library itself.

Requirements to achieve this (from the top of my head):

  1. A mechanism to download/update the Anthology XML data from within the Python package. Many Python packages download external data as part of their functionality (e.g., NLTK, torchtext), and I've personally used GitPython to do exactly this with the ACL Anthology for my recent Anthology analysis paper. I believe this is completely solvable.

  2. A proper documentation. If we want to promote our Python API in this way, we should have at least a succinct, user-friendly documentation that gets people started on how to use it. I believe that might be good thing to have anyway, to help future volunteers for the Anthology who might work on the Python API. I'd also be happy to help prepare it.

  3. Faster loading as discussed in #835 could be a major factor for usability. I have more ideas in this direction that I want to look into at some point, but maybe it's more of a "nice-to-have" than an actual blocker?

Most importantly, I think it would be great to gauge the community's interest in this. If you'd be interested in and see value in working with Anthology data through a pip-installable library, give a thumbs up here!

akoehn commented 3 years ago

I think the most work is proper versioning and releases. Right now code and data are automatically synchronized because they are in the same repository, but we cannot guarantee that an old version of the library works with new data (and we should not try to change that) and we currently have no versioning at all.

Extracting the code into its own git repo and embedding it here creates a lot of overhead (speaking from experience with these setups in an academic setting) and I don't know how we would have version numbers & releases while keeping the code in here.

mbollmann commented 3 years ago

Great points, @akoehn.

Versioning would indeed require more thought. We could have a file in data/ indicating the minimum version of the library needed to work with it, so the library could warn its users when it's outdated and no longer compatible with the latest XML. But it'd certainly be more work.

Conversely, though, you could say that the lack of versioning currently makes it less attractive for people to build on our API, since it could change at any moment without clear documentation. That's why I'm wondering how many people would even be interested in this, to see if it makes sense to think about this.

Extracting the code into its own git repo and embedding it here creates a lot of overhead

Are you thinking of the git submodule approach here? I don't see a lot of problems with just adding our package to this repo's requirements.txt instead, but maybe I haven't fully thought this through.

I don't know how we would have version numbers & releases while keeping the code in here.

I'm not sure what problems you foresee here; version numbers for the Python package could be kept in a subdirectory where the package lives (say lib/), and releases to PyPi could be triggered manually by us when appropriate.

akoehn commented 3 years ago

Are you thinking of the git submodule approach here?

No, I meant another repo. The thing is that fixing a bug is straight-forward now. With a separate repository, you would need to check out acl-anthology and the anthology code, make changes to the code, publish it locally (or otherwise make sure it is used by acl-anthology) test whether your fix worked, repeat.

The easiest way would probably be to generate a pypi package from the current setup where the core anthology code base is together with the library part in one repository and we don't have to think about versioning all the time.

mbollmann commented 10 months ago

There's a first usable version of a PyPI library now: https://pypi.org/project/acl-anthology-py/

I'm currently developing this in a separate repo, but I've thought about the versioning issues and think it should probably be moved into this repo, as keeping it in sync with the data format here (XML schema etc.) does seem like a headache otherwise. I don't see a big problem with having version numbers & releases within this repo, though.

Over the coming weeks, I'll prepare a feature branch here that merges in this library, so that we can continue the discussion here.