Open BramVanroy opened 3 years ago
Hm, I see your point, though I think most Windows users will work in a CLI environment that does support these operations? At least this is why we haven't caught this before: I'm probably the only Windows-user on the team and the projects work just fine for me :p How does one live without mv
and mkdir
? ;p
Well, I switch. For programming, I use PowerShell 7 for GPU stuff, and WSL for other stuff. PowerShell has these commands, but for instance for rm
it will throw an error when you use rm -rf
.
I came across this error when I ran a command from within the terminal inside PyCharm. I guess I never swapped it out for PowerShell. 🤓
That being said, I still think that examples should speak to a general audience and especially new people or not-so-technical people will run into problems with this. And yes, I do think that not-so-technical people will try their hand at training their own model. Many of my colleagues or people in digital humanities want to use all these cool things, but they do not have a lot of experience or background with it. I'd argue that we should make examples as broad and accessible as possible, but I understand that there are many counter-arguments to this.
We can leave this open, and when I find the time I can look into doing a cross-platform pull request?
Sure! PRs would definitely be appreciated :-)
The first PR to start resolving this, https://github.com/explosion/projects/pull/42, fixes the textcat_goemotions
project. More PRs are always welcome ;-)
I'm having the same issue with the textcat_geomotions project. Getting the following error:
NotFoundError: [E970] Can not execute command 'mkdir -p training/cnn'. Do you have 'mkdir' installed?
Other details:
Happy to test something if required.
@milan101 : that specific case should be fixed by the PR I mentioned, explosion/projects#42 ;-)
@svlandeg I realized my original PR explosion/projects#42 isn't working as originally thought. The python -c
command I added in doesn't actually create the directory, but that line is not necessary as the spacy train
command will [create the specified output directory if it doesn't exist](https://github.com/explosion/spaCy/blob/af9d98440714085ceee0c3ae0b92afcc63a87024/spacy/cli/train.py#:~:text=if%20output_path%20is%20not%20None%20and%20not%20output_path.exists()%3A). I figured this out while trying to implement a similar fix for other projects that contain linux cli commands such as the ud_benchmark
.
This is an issue with ud_benchmark
because the convert
command checks for the existence of the output directory and exits), where the train
cli command does.
It seems running anything via python -c
in the yml runs into very strange string parsing issues and fails silently. In the output below you can see the excessive number of quotations added by the yml parsing for python -c "import os; os.makedirs(os.path.join('training', '${vars.config}'))"
:
Running command: 'C:\Users\ctufts\Anaconda3\envs\spacyProjectsBug\python.exe' -c '"import os; os.makedirs(os.path.join('"'"'training'"'"', '"'"'cnn'"'"'))"'
Running command: 'C:\Users\ctufts\Anaconda3\envs\spacyProjectsBug\python.exe' -m spacy train ./configs/cnn.cfg -o training/cnn --gpu-id -1
✔ Created output directory: training\cnn
This is why I was able to get the textcat_geomotions project running on windows after my previous PR: the line I added failed to create the directory silently, but was then created in the train
cli step.
I can resolve this by either:
1) adding helper scripts (python) for the os commands (mkdir, mv, etc) and replace the mkdir and mv references in the project.yml
's instead of running via python -c
2) altering the cli functions to operate the same way spacy train
does, where the output directory is created if it doesn't already exist.
I'm assuming solution 1 is probably less invasive, but figured I'd see if you had any thought on this before I proceed.
Hi @ctufts !
Yea, I think you're right that there's an issue here. I think option 1 would be the easiest for now - so just add some preprocessing script that takes care of setting up IO instead of using these Linux commands?
A PR would definitely be welcome!
Hi,
I have the same problem on Windows with "tagger_parser_ud". I am not aware how to rewrite the Unix-scripts to valid windows scripts. Can someone help?
e.g.
- "mkdir -p corpus/${vars.treebank}"
I tried:
- "mkdir corpus\\${vars.treebank}"
which results in: Running command: mkdir 'corpus\UD_English-EWT'
This does not work.
As discussed on https://github.com/explosion/projects/pull/69: I think we should do an effort to try and avoid these platform-specific commands as much as possible, by either removing them when unnecessary (e.g. spacy train
doesn't require an upfront mkdir
command for the output as it'll create the dir if it doesn't exist), or by implementing some custom (pre)processing scripts that set up the working directory or delete intermediate files.
Related to this, some projects (the tagger_parser
ones in pipelines
) remove directories that were checked out git repos using rm -rf
, but this fails in Windows because some of the files are marked read-only. There are a couple of workarounds for this, like using chmod
on the whole directory before deleting it, but I'm not sure if there's a more idiomatic way to do it.
How to reproduce the behaviour
I was planning to have a look at the example projects and quickly found that these are very Linux oriented. The commands are all Linux commands (
mv
,mkdir
, etc.). Of course, there are ways around (WSL, other CLIs), but generally, these commands do not work well on Windows. This is not really a bug (it is expected) but I wasn't sure how to tag this differently.Preferably, all commands in
script
are cross-platform so that they can be run on any platform. These should be able to be replaced by Python-c
commands to ensure cross-compatibility. I am aware that these are intended as example projects, but especially for new users it would be great if these "just work" cross-platform. Something like the following should work I think (untested).Info about spaCy