explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.23k stars 4.4k forks source link

Is it not possible to fetch assets using a script? #7186

Closed KennethEnevoldsen closed 3 years ago

KennethEnevoldsen commented 3 years ago

Hei I have been experimenting with the new project structure and while it mostly is a very convenient format. However, the options for assets seem to be one of three 1) place them in manually, 2) git, 3) URL. However one might want to use open-source datasets (e.g. Huggingface's datasets) and simply have a script that downloads and writes those files to the assets. This does not seem possible in the current workflow?

(You can naturally add a command, but it seems like the intention is to first fetch assets and then run commands or workflows)

Which page or section is this issue related to?

https://spacy.io/usage/projects#directory

The work on version 3 looks very promising, looking forward to using it more, Kenneth

adrianeboyd commented 3 years ago

If you want to use assets, I think you'd have to handle this with a custom download script and treat it as a private asset: https://spacy.io/usage/projects#data-asets-private

I think you could also skip assets entirely and start with steps that just write to corpus (or wherever) and it would be up to you to confirm the checksums in your custom download scripts as necessary. spacy project would then track the checksums for the specified outputs like corpus/train.spacy for the following steps in the project. It should be pretty flexible?

KennethEnevoldsen commented 3 years ago

Thanks. I simply wanted to make sure I didn't misunderstand the workflow.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.