NormXU / nougat-latex-ocr

Codebase for fine-tuning / evaluating nougat-based image2latex generation models
https://arxiv.org/abs/2308.13418
Apache License 2.0
113 stars 12 forks source link

Issue with num_samples #8

Closed rprasad2 closed 1 month ago

rprasad2 commented 1 month ago

When I run this script with an updated base.yaml: python tools/train_experiment.py --config_file config/base.yaml --phase 'train'

Here is the issue /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-07-17 21:48:44 INFO root base_experiment.py:174 - device:cuda:0, is_master:True, device_ids:[0], is_distributed:False Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-07-17 21:48:48 INFO root donut_experiment.py:145 - init weight from pretrained model:facebook/nougat-base 2024-07-17 21:48:48 INFO root donut_experiment.py:152 - Number of parameter: 348.69M 0it [00:00, ?it/s] Traceback (most recent call last): File "/home/ubuntu/nougat-latex-ocr/tools/train_experiment.py", line 53, in <module> main(args) File "/home/ubuntu/nougat-latex-ocr/tools/train_experiment.py", line 43, in main experiment_instance = getattr(experiment, get_experiment_name(args.experiment_name))(config) File "/home/ubuntu/nougat-latex-ocr/experiment/donut_experiment.py", line 30, in __init__ self.init_dataset(config) File "/home/ubuntu/nougat-latex-ocr/experiment/donut_experiment.py", line 176, in init_dataset self.train_data_loader = self._get_data_loader_from_dataset(self.train_dataset, File "/home/ubuntu/nougat-latex-ocr/experiment/donut_experiment.py", line 220, in _get_data_loader_from_dataset data_loader = DataLoader(dataset, File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 350, in __init__ sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 143, in __init__ raise ValueError(f"num_samples should be a positive integer value, but got num_samples={self.num_samples}") ValueError: num_samples should be a positive integer value, but got num_samples=0

NormXU commented 1 month ago

@rprasad2 Look like the dataloader cannot find any data for the given path. Please check the data path