Open erichen510 opened 2 years ago
Exactly what url are you trying to retrieve?
Are you authenticated on gcloud?
The exact url is:
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_social_media/splits/social_train.jsonl social_train.json
How to get authorization on gcloud? Am I suppose to join the project?
You are trying to access a non-open dataset. Where was this linked from?
The link is from https://github.com/NbAiLab/notram/blob/54aeb6b06799cb22119b5c22a24ac3720dd88c40/guides/configure_flax.md?plain=1#L247
I want to pretrain the corpus on roberta large. If I cannot get the json, where should I get the original corpus? I notice that https://huggingface.co/datasets/NbAiLab/NCC list the datasets, could you tell me how to convert the original data to the json required by run_mlm_flax_stream.py.
Sorry the link is
gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_official_short/norwegian_colossal_corpus_train.jsonl norwegian_colossal_corpus_train.json
Sorry. There is an internal link in this guide. You should replace this with whatever dataset you have available.
One alternative is of course the NCC (that was released after this tutorial was written).
There are several ways of training on this dataset. Assuming you are using Flax (since you are following the tutorial), a simple was is to specify dataset_name NbAiLab/NCC
instead of train and validation file. Another way is to clone the HuggingFace repo and copy/combine the files from the repo. NCC is already in json format, but it is sharded and zipped. If you insist on having them locally, they should be combined and unzipped.
Early next year, we will also place the NCC in an open gcloud bucket.
The error message is :
does not have storage.objects.list access to the Google Cloud Storage bucket.