AI4Bharat / IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME
https://ai4bharat.iitm.ac.in/language-understanding
MIT License
73 stars 13 forks source link

Memory Error: While Preprocessing Tokeinzer for Urdu Language #4

Open HSultankhan opened 1 year ago

HSultankhan commented 1 year ago

Hello , I want to create a tokenizer for urdu language and I have used this

(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000

image

After this: as per instructions: I used this command:

(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"

image

image

This happened multiple times,

image

AS this whole architecture is not using GPU.
Here are my specs,

Processor: i7-9700k : 3.6GHz Ram : 32GB GPU: Nvidia GTX 1660ti (6gb)

I actually have two questions:

How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?

Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?