Closed KennethEnevoldsen closed 3 years ago
If you want to use assets
, I think you'd have to handle this with a custom download script and treat it as a private asset: https://spacy.io/usage/projects#data-asets-private
I think you could also skip assets
entirely and start with steps that just write to corpus
(or wherever) and it would be up to you to confirm the checksums in your custom download scripts as necessary. spacy project
would then track the checksums for the specified outputs like corpus/train.spacy
for the following steps in the project. It should be pretty flexible?
Thanks. I simply wanted to make sure I didn't misunderstand the workflow.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hei I have been experimenting with the new project structure and while it mostly is a very convenient format. However, the options for assets seem to be one of three 1) place them in manually, 2) git, 3) URL. However one might want to use open-source datasets (e.g. Huggingface's datasets) and simply have a script that downloads and writes those files to the assets. This does not seem possible in the current workflow?
(You can naturally add a command, but it seems like the intention is to first fetch assets and then run commands or workflows)
Which page or section is this issue related to?
https://spacy.io/usage/projects#directory
The work on version 3 looks very promising, looking forward to using it more, Kenneth