Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Cache behaviour #74

Closed thfrkielikone closed 2 months ago

thfrkielikone commented 4 months ago

This is not much of a breaking bug, but I'd wanna ask whether this is intended behaviour. If I have opusfilter installed as root and then want to use it as a non-root user. I get the following behaviour:

WARNING:opusfilter.opusfilter:Output directory not specified. Writing files to current directory.
INFO:opusfilter.opusfilter:Running step 1: opus_read
The following files are available for downloading:

Traceback (most recent call last):
  File "/usr/local/bin/opusfilter", line 31, in <module>
    of.execute_steps(overwrite=args.overwrite, last=args.last)
  File "/usr/local/lib/python3.12/site-packages/opusfilter/opusfilter.py", line 224, in execute_steps
    self._run_step(step, num + 1, overwrite)
  File "/usr/local/lib/python3.12/site-packages/opusfilter/opusfilter.py", line 289, in _run_step
    self.step_functions[step['type']](parameters, overwrite=overwrite)
  File "/usr/local/lib/python3.12/site-packages/opusfilter/opusfilter.py", line 327, in read_from_opus
    opus_reader = OpusRead(
                  ^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/opustools/opus_read.py", line 196, in __init__
    moses_names = self.of_handler.open_moses_files()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/opustools/opus_file_handler.py", line 45, in open_moses_files
    self.download_files()
  File "/usr/local/lib/python3.12/site-packages/opustools/opus_file_handler.py", line 33, in download_files
    og = OpusGet(**arguments)
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/opustools/opus_get.py", line 40, in __init__
    with open(DB_FILE, 'wb') as outfile:
         ^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.12/site-packages/opustools/opusdata.db'

I can give the .yml but I don't think it is relevant to this. My gripe is that it tries to create the opusdata.db file to a system location. Would something like a dot-directory (~/.local, ~/.cache, ...) be better here? Would that break some use case?

I am running this in a docker+singularity situation where because of the container nature of the thing, installing as root makes sense (but running as root doesn't, because the singularity env doesn't allow it). I understand that the equivalent location when using a venv would neatly be inside the venv, and that is what I am going to try next for my own purposes. Still, would this be neater if the db file would be stored somewhere else?

svirpioj commented 3 months ago

This is also apparently an issue in OpusTools. I don't know exactly why that database file is opened in write mode, maybe @miau1 can comment on this? Quickly looking at the code, opus_get seems to have some DB related options, but opus_read (used by OpusFilter) does not.

If the file needs to be writable, lib doesn't sound a proper place for the file even inside a venv, but a customizable location with a sensible default like ~/.OpusTools/opusdata.db.

miau1 commented 3 months ago

The database is opened in write mode to uncompress it the first time opus_get is used. But actually the database is needed only when using the --local_db option, so now the file is uncompressed only the first the --local_db option is used. Additionally, the default location of the db file is now ~/.OpusTools/opusdata.db. Both of these changes are in the latest version of OpusTools, also in PyPI

thfrkielikone commented 2 months ago

Thank you very much for solving this small gripe.