By default, the full path to the sample images of the created datasets are the concatenation between the parent of the JSONL file, a subfolder named "arxiv", and the image path indicated in the JSONL file. For example, if the dataset path is "/path/to/exp_folder/train.jsonl", and the path to the first sample is "sample_paper/01.png", then the sample full path will be "/path/to/exp_folder/arxiv/sample_paper/01.png"
However, this root subfolder name "arxiv" is not indicated in the datasets creation tutorial in the README (instead we have "path/paired/output" or "images"), so when I tried to run the train.py script with my samples in a subfolder called "folder_paired", I got an error.
This PR enables the user to choose any subfolder name as "datasets_root" in the training config file.
By default, the full path to the sample images of the created datasets are the concatenation between the parent of the JSONL file, a subfolder named "arxiv", and the image path indicated in the JSONL file. For example, if the dataset path is "/path/to/exp_folder/train.jsonl", and the path to the first sample is "sample_paper/01.png", then the sample full path will be "/path/to/exp_folder/arxiv/sample_paper/01.png"
However, this root subfolder name "arxiv" is not indicated in the datasets creation tutorial in the README (instead we have "path/paired/output" or "images"), so when I tried to run the
train.py
script with my samples in a subfolder called "folder_paired", I got an error.This PR enables the user to choose any subfolder name as "datasets_root" in the training config file.
I'm wondering if that's all that was implied in this TODO ? https://github.com/facebookresearch/nougat/blob/47c77d70727558b4a2025005491ecb26ee97f523/nougat/utils/dataset.py#L227