Helsinki-NLP / OPUS-ingest

4 stars 0 forks source link

chores: clean up repo #23

Closed SethFalco closed 1 year ago

SethFalco commented 1 year ago

While trying to find my way around the repo for https://github.com/Helsinki-NLP/OPUS-ingest/pull/22, I did a few chores.

Let me know if you want me to split off any of the changes into its own commit or PR. I thought this might be easier and less spammy for you than creating multiple smaller PRs.

Documentation

Update how to create a new dataset.

Makefile

There is already a rule on line 31 that will clone the required repos. There's no need to have the clone command in each target as well.

Python Dependencies

Switch from fast-mosestokenizer to opus-fast-mosestokenizer. This avoids the following error when installing dependencies on my environment. (Because I have Python 3.11.4 installed, but fast-mosestokenizer requires < 3.11.)

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 3.1.1.post1 Requires-Python >=2.6, <3; 3.2.0 Requires-Python >=2.6, <3; 3.3.0 Requires-Python >=2.6, <3; 3.4.0 Requires-Python >=2.6, <3
ERROR: Could not find a version that satisfies the requirement fast-mosestokenizer (from versions: none)
ERROR: No matching distribution found for fast-mosestokenizer

EditorConfig

Adds an .editorconfig file. Most code editors and IDEs support this file to set workspace settings. Since your settings don't match mine, this makes it more convenient to work with your project without inconsistencies like line endings, tabs/spaced, and indentation.

Reference: EditorConfig

Git Ignore

Adds .gitignore with venv/ which is for Python virtual environments.

Many Python users prefer to use venv to manage their environments, so packages don't conflict between projects or their operating system. Often users use venv as a sensible name, so it's safe to ignore this directory.

As a Debian 12 user myself, I create virtual environments for all projects to avoid conflicts with packages my distribution depends on.

Reference: Python documentation for virtual environments

Misc