Could not get the train json by gsutil

NbAiLab / notram

Norwegian Transformer Model

Apache License 2.0

114 stars 6 forks source link

Could not get the train json by gsutil #1

Open erichen510 opened 2 years ago

erichen510 commented 2 years ago

The error message is : does not have storage.objects.list access to the Google Cloud Storage bucket.

peregilk commented 2 years ago

Exactly what url are you trying to retrieve?

Are you authenticated on gcloud?

erichen510 commented 2 years ago

The exact url is: gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_social_media/splits/social_train.jsonl social_train.json How to get authorization on gcloud? Am I suppose to join the project?

peregilk commented 2 years ago

You are trying to access a non-open dataset. Where was this linked from?

erichen510 commented 2 years ago

The link is from https://github.com/NbAiLab/notram/blob/54aeb6b06799cb22119b5c22a24ac3720dd88c40/guides/configure_flax.md?plain=1#L247

I want to pretrain the corpus on roberta large. If I cannot get the json, where should I get the original corpus? I notice that https://huggingface.co/datasets/NbAiLab/NCC list the datasets, could you tell me how to convert the original data to the json required by run_mlm_flax_stream.py.

erichen510 commented 2 years ago

Sorry the link is gsutil -m cp gs://notram-west4-a/pretrain_datasets/notram_v2_official_short/norwegian_colossal_corpus_train.jsonl norwegian_colossal_corpus_train.json

peregilk commented 2 years ago

Sorry. There is an internal link in this guide. You should replace this with whatever dataset you have available.

One alternative is of course the NCC (that was released after this tutorial was written).

There are several ways of training on this dataset. Assuming you are using Flax (since you are following the tutorial), a simple was is to specify dataset_name NbAiLab/NCC instead of train and validation file. Another way is to clone the HuggingFace repo and copy/combine the files from the repo. NCC is already in json format, but it is sharded and zipped. If you insist on having them locally, they should be combined and unzipped.

peregilk commented 2 years ago

Early next year, we will also place the NCC in an open gcloud bucket.