explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.17k stars 4.4k forks source link

Example projects not cross-OS #6957

Open BramVanroy opened 3 years ago

BramVanroy commented 3 years ago

How to reproduce the behaviour

I was planning to have a look at the example projects and quickly found that these are very Linux oriented. The commands are all Linux commands (mv, mkdir, etc.). Of course, there are ways around (WSL, other CLIs), but generally, these commands do not work well on Windows. This is not really a bug (it is expected) but I wasn't sure how to tag this differently.

Preferably, all commands in script are cross-platform so that they can be run on any platform. These should be able to be replaced by Python -c commands to ensure cross-compatibility. I am aware that these are intended as example projects, but especially for new users it would be great if these "just work" cross-platform. Something like the following should work I think (untested).

python -c "from pathlib import Path; Path('corpus/${vars.treebank}').mkdir(exist_ok=True)"
python -c "from pathlib import Path; Path('corpus/${vars.treebank}/${vars.train_name}.spacy').rename('corpus/${vars.treebank}/train.spacy)'"
python -c "from pathlib import Path; Path('corpus/${vars.treebank}/${vars.dev_name}.spacy').rename('corpus/${vars.treebank}/dev.spacy)'"
python -c "from pathlib import Path; Path('corpus/${vars.treebank}/${vars.test_name}.spacy').rename('corpus/${vars.treebank}/test.spacy)'"

Info about spaCy

svlandeg commented 3 years ago

Hm, I see your point, though I think most Windows users will work in a CLI environment that does support these operations? At least this is why we haven't caught this before: I'm probably the only Windows-user on the team and the projects work just fine for me :p How does one live without mv and mkdir? ;p

BramVanroy commented 3 years ago

Well, I switch. For programming, I use PowerShell 7 for GPU stuff, and WSL for other stuff. PowerShell has these commands, but for instance for rm it will throw an error when you use rm -rf.

I came across this error when I ran a command from within the terminal inside PyCharm. I guess I never swapped it out for PowerShell. 🤓

That being said, I still think that examples should speak to a general audience and especially new people or not-so-technical people will run into problems with this. And yes, I do think that not-so-technical people will try their hand at training their own model. Many of my colleagues or people in digital humanities want to use all these cool things, but they do not have a lot of experience or background with it. I'd argue that we should make examples as broad and accessible as possible, but I understand that there are many counter-arguments to this.

We can leave this open, and when I find the time I can look into doing a cross-platform pull request?

svlandeg commented 3 years ago

Sure! PRs would definitely be appreciated :-)

svlandeg commented 3 years ago

The first PR to start resolving this, https://github.com/explosion/projects/pull/42, fixes the textcat_goemotions project. More PRs are always welcome ;-)

milan101 commented 3 years ago

I'm having the same issue with the textcat_geomotions project. Getting the following error:

NotFoundError: [E970] Can not execute command 'mkdir -p training/cnn'. Do you have 'mkdir' installed?

Other details:

Happy to test something if required.

# Name Version Build Channel _pytorch_select 0.1 cpu_0 aiohttp 3.7.4 py38h294d835_0 conda-forge async-timeout 3.0.1 py_1000 conda-forge attrs 20.3.0 pyhd3deb0d_0 conda-forge blas 1.0 mkl conda-forge boto 2.49.0 py_0 conda-forge boto3 1.17.18 pyhd8ed1ab_0 conda-forge botocore 1.20.18 pyhd8ed1ab_0 conda-forge brotlipy 0.7.0 py38h294d835_1001 conda-forge bz2file 0.98 py_0 conda-forge ca-certificates 2020.12.5 h5b45459_0 conda-forge cachetools 4.2.1 pyhd8ed1ab_0 conda-forge catalogue 2.0.1 py38haa244fe_2 conda-forge certifi 2020.12.5 py38haa244fe_1 conda-forge cffi 1.14.5 py38hd8c33c5_0 conda-forge chardet 4.0.0 py38haa244fe_1 conda-forge click 7.1.2 pyh9f0ad1d_0 conda-forge colorama 0.4.4 pyh9f0ad1d_0 conda-forge cryptography 3.4.4 py38hb7941b4_0 conda-forge cudatoolkit 11.2.1 h7a7aa70_8 conda-forge cymem 2.0.5 py38h885f38d_1 conda-forge cython-blis 0.7.4 py38h347fdf6_0 conda-forge dataclasses 0.7 pyhb2cacf7_7 conda-forge google-api-core 1.25.1 pyh44b312d_0 conda-forge google-auth 1.24.0 pyhd3deb0d_0 conda-forge google-cloud-core 1.5.0 pyhd3deb0d_0 conda-forge google-cloud-storage 1.19.0 py_0 conda-forge google-crc32c 1.1.2 py38h554a69a_0 conda-forge google-resumable-media 1.2.0 pyhd3deb0d_0 conda-forge googleapis-common-protos 1.52.0 py38h5b57dd5_1 conda-forge grpcio 1.36.0 py38he5377a8_0 conda-forge idna 2.10 pyh9f0ad1d_0 conda-forge intel-openmp 2020.3 h57928b3_311 conda-forge jinja2 2.11.3 pyh44b312d_0 conda-forge jmespath 0.10.0 pyh9f0ad1d_0 conda-forge libblas 3.9.0 8_mkl conda-forge libcblas 3.9.0 8_mkl conda-forge libcrc32c 1.1.1 h0e60522_2 conda-forge liblapack 3.9.0 8_mkl conda-forge libprotobuf 3.15.3 h7755175_0 conda-forge markupsafe 1.1.1 py38h294d835_3 conda-forge mkdir 2020.12.3 pypi_0 pypi mkl 2020.4 hb70f87d_311 conda-forge mkl-service 2.3.0 py38h1e8a9f7_2 conda-forge multidict 5.1.0 py38h294d835_1 conda-forge murmurhash 1.0.5 py38h885f38d_0 conda-forge ninja 1.10.2 h5362a0b_0 conda-forge numpy 1.20.1 py38h0cc643e_0 conda-forge openssl 1.1.1j h8ffe710_0 conda-forge packaging 20.9 pyh44b312d_0 conda-forge pathy 0.4.0 pyhd8ed1ab_0 conda-forge pip 21.0.1 pyhd8ed1ab_0 conda-forge preshed 3.0.5 py38h885f38d_0 conda-forge protobuf 3.15.3 py38h885f38d_0 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycparser 2.20 pyh9f0ad1d_2 conda-forge pydantic 1.7.3 py38h294d835_1 conda-forge pyopenssl 20.0.1 pyhd8ed1ab_0 conda-forge pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge pysocks 1.7.1 py38haa244fe_3 conda-forge python 3.8.8 h7840368_0_cpython conda-forge python-dateutil 2.8.1 py_0 conda-forge python_abi 3.8 1_cp38 conda-forge pytorch 1.4.0 cpu_py38ha775e86_0 pytz 2021.1 pyhd8ed1ab_0 conda-forge requests 2.25.1 pyhd3deb0d_0 conda-forge rsa 4.7.2 pyh44b312d_0 conda-forge s3transfer 0.3.4 pyhd8ed1ab_0 conda-forge setuptools 49.6.0 py38haa244fe_3 conda-forge shellingham 1.4.0 pyh44b312d_0 conda-forge six 1.15.0 pyh9f0ad1d_0 conda-forge smart_open 2.2.1 pyh9f0ad1d_0 conda-forge spacy 3.0.3 py38h1266d08_0 conda-forge spacy-legacy 3.0.1 pyhd8ed1ab_0 conda-forge sqlite 3.34.0 h8ffe710_0 conda-forge srsly 2.4.0 py38h885f38d_2 conda-forge thinc 8.0.1 py38h1266d08_1 conda-forge tqdm 4.58.0 pyhd8ed1ab_0 conda-forge typer 0.3.2 pyhd8ed1ab_0 conda-forge typing-extensions 3.7.4.3 0 conda-forge typing_extensions 3.7.4.3 py_0 conda-forge urllib3 1.26.3 pyhd8ed1ab_0 conda-forge values 2020.12.3 pypi_0 pypi vc 14.2 hb210afc_3 conda-forge vs2015_runtime 14.28.29325 h5e1d092_3 conda-forge wasabi 0.8.2 pyh44b312d_0 conda-forge wheel 0.36.2 pyhd3deb0d_0 conda-forge win_inet_pton 1.1.0 py38haa244fe_2 conda-forge wincertstore 0.2 py38haa244fe_1006 conda-forge yarl 1.6.3 py38h294d835_1 conda-forge zlib 1.2.11 h62dcd97_1010 conda-forge
svlandeg commented 3 years ago

@milan101 : that specific case should be fixed by the PR I mentioned, explosion/projects#42 ;-)

ctufts commented 3 years ago

@svlandeg I realized my original PR explosion/projects#42 isn't working as originally thought. The python -c command I added in doesn't actually create the directory, but that line is not necessary as the spacy train command will [create the specified output directory if it doesn't exist](https://github.com/explosion/spaCy/blob/af9d98440714085ceee0c3ae0b92afcc63a87024/spacy/cli/train.py#:~:text=if%20output_path%20is%20not%20None%20and%20not%20output_path.exists()%3A). I figured this out while trying to implement a similar fix for other projects that contain linux cli commands such as the ud_benchmark.

This is an issue with ud_benchmark because the convert command checks for the existence of the output directory and exits), where the train cli command does.

It seems running anything via python -c in the yml runs into very strange string parsing issues and fails silently. In the output below you can see the excessive number of quotations added by the yml parsing for python -c "import os; os.makedirs(os.path.join('training', '${vars.config}'))" :

Running command: 'C:\Users\ctufts\Anaconda3\envs\spacyProjectsBug\python.exe' -c '"import os; os.makedirs(os.path.join('"'"'training'"'"', '"'"'cnn'"'"'))"'
Running command: 'C:\Users\ctufts\Anaconda3\envs\spacyProjectsBug\python.exe' -m spacy train ./configs/cnn.cfg -o training/cnn --gpu-id -1
✔ Created output directory: training\cnn

This is why I was able to get the textcat_geomotions project running on windows after my previous PR: the line I added failed to create the directory silently, but was then created in the train cli step.

I can resolve this by either: 1) adding helper scripts (python) for the os commands (mkdir, mv, etc) and replace the mkdir and mv references in the project.yml 's instead of running via python -c 2) altering the cli functions to operate the same way spacy train does, where the output directory is created if it doesn't already exist.

I'm assuming solution 1 is probably less invasive, but figured I'd see if you had any thought on this before I proceed.

svlandeg commented 3 years ago

Hi @ctufts !

Yea, I think you're right that there's an issue here. I think option 1 would be the easiest for now - so just add some preprocessing script that takes care of setting up IO instead of using these Linux commands?

A PR would definitely be welcome!

JBOE22175 commented 3 years ago

Hi, I have the same problem on Windows with "tagger_parser_ud". I am not aware how to rewrite the Unix-scripts to valid windows scripts. Can someone help? e.g. - "mkdir -p corpus/${vars.treebank}"

I tried: - "mkdir corpus\\${vars.treebank}"

which results in: Running command: mkdir 'corpus\UD_English-EWT' This does not work.

svlandeg commented 3 years ago

As discussed on https://github.com/explosion/projects/pull/69: I think we should do an effort to try and avoid these platform-specific commands as much as possible, by either removing them when unnecessary (e.g. spacy train doesn't require an upfront mkdir command for the output as it'll create the dir if it doesn't exist), or by implementing some custom (pre)processing scripts that set up the working directory or delete intermediate files.

polm commented 2 years ago

Related to this, some projects (the tagger_parser ones in pipelines) remove directories that were checked out git repos using rm -rf, but this fails in Windows because some of the files are marked read-only. There are a couple of workarounds for this, like using chmod on the whole directory before deleting it, but I'm not sure if there's a more idiomatic way to do it.