Open HSultankhan opened 1 year ago
Hello , I want to create a tokenizer for urdu language and I have used this
(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000
After this: as per instructions: I used this command:
(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"
This happened multiple times,
AS this whole architecture is not using GPU. Here are my specs,
Processor: i7-9700k : 3.6GHz Ram : 32GB GPU: Nvidia GTX 1660ti (6gb)
I actually have two questions:
How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?
Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?
Hello , I want to create a tokenizer for urdu language and I have used this
(tpu_data) D:>python IndicBERT/tokenization/build_tokenizer.py --input "D:\IndicBERT\ur.txt" --output "D:\IndicBERT\output" --vocab_size 250000
After this: as per instructions: I used this command:
(tpu_data) D:>python IndicBERT/process_data/create_mlm_data.py --input_file="D:\IndicBERT\ur.txt" --output_file="D:\IndicBERT\output" --input_file_type=monolingual --tokenizer="D:\IndicBERT\output\config.json"
This happened multiple times,
AS this whole architecture is not using GPU.
Here are my specs,
Processor: i7-9700k : 3.6GHz Ram : 32GB GPU: Nvidia GTX 1660ti (6gb)
I actually have two questions:
How to resolve this memory error? Is there a way to use GPU? as this preprocessing is not utilizing the GPU or should I use Google Colab?
Secondly: As I only require a tokenizer for urdu language, After Preprocess Data , Will I have the tokenizer json file?