Let me know if you want me to split off any of the changes into its own commit or PR. I thought this might be easier and less spammy for you than creating multiple smaller PRs.
Documentation
Update how to create a new dataset.
Use correct paths by the repos structure at present.
Don't specify the repo name in paths, users can name the directory for the repo while cloning. It's best to specify a path relative from the repos root since that's what the project controls, but nothing outside of that.
Makefile
There is already a rule on line 31 that will clone the required repos. There's no need to have the clone command in each target as well.
Python Dependencies
Switch from fast-mosestokenizer to opus-fast-mosestokenizer.
This avoids the following error when installing dependencies on my environment. (Because I have Python 3.11.4 installed, but fast-mosestokenizer requires < 3.11.)
ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 3.1.1.post1 Requires-Python >=2.6, <3; 3.2.0 Requires-Python >=2.6, <3; 3.3.0 Requires-Python >=2.6, <3; 3.4.0 Requires-Python >=2.6, <3
ERROR: Could not find a version that satisfies the requirement fast-mosestokenizer (from versions: none)
ERROR: No matching distribution found for fast-mosestokenizer
EditorConfig
Adds an .editorconfig file. Most code editors and IDEs support this file to set workspace settings. Since your settings don't match mine, this makes it more convenient to work with your project without inconsistencies like line endings, tabs/spaced, and indentation.
Adds .gitignore with venv/ which is for Python virtual environments.
Many Python users prefer to use venv to manage their environments, so packages don't conflict between projects or their operating system. Often users use venv as a sensible name, so it's safe to ignore this directory.
As a Debian 12 user myself, I create virtual environments for all projects to avoid conflicts with packages my distribution depends on.
While trying to find my way around the repo for https://github.com/Helsinki-NLP/OPUS-ingest/pull/22, I did a few chores.
Let me know if you want me to split off any of the changes into its own commit or PR. I thought this might be easier and less spammy for you than creating multiple smaller PRs.
Documentation
Update how to create a new dataset.
Makefile
There is already a rule on line 31 that will clone the required repos. There's no need to have the clone command in each target as well.
Python Dependencies
Switch from
fast-mosestokenizer
toopus-fast-mosestokenizer
. This avoids the following error when installing dependencies on my environment. (Because I have Python 3.11.4 installed, butfast-mosestokenizer
requires < 3.11.)EditorConfig
Adds an
.editorconfig
file. Most code editors and IDEs support this file to set workspace settings. Since your settings don't match mine, this makes it more convenient to work with your project without inconsistencies like line endings, tabs/spaced, and indentation.Reference: EditorConfig
Git Ignore
Adds
.gitignore
withvenv/
which is for Python virtual environments.Many Python users prefer to use venv to manage their environments, so packages don't conflict between projects or their operating system. Often users use
venv
as a sensible name, so it's safe to ignore this directory.As a Debian 12 user myself, I create virtual environments for all projects to avoid conflicts with packages my distribution depends on.
Reference: Python documentation for virtual environments
Misc
European Medicinces Agency
→European Medicines Agency