zhlhlhlhl commented 2 months ago

when I run feature_extraction.py, it needs an augment '--dataset-root '; what's this stand for? Now I have the raw data path and the processed one containing a list of data like 'imagenet_train-001281-train.tar'. And the '--dataset-root' should have a subfolder named 'images'. Should I unzip the data and store them in '/image'?

elicassion commented 2 months ago

Hi @zhlhlhlhl

Please refer to our dataset format note for the dataset structure. --dataset-root is the top level directory to hold all your datasets, i.e., the <root_dir_to_hold_datasets> in the aforementioned note. (Sorry for that the link to dataset format does not work on main README! I will fix it soon.)

The tar files are good. You can move them to <root_dir_to_hold_datasets>/imagenet/images. If you don't have images folder, you can create one and move your tar files under it. This folder means the images from this dataset. You don't need to untar the file. The generated features will go to the other folders at the same level of images. For example, ViT will go to <root_dir_to_hold_datasets>/imagenet/google_vit-huge-patch14-224-in21k

zhlhlhlhl commented 2 months ago

Hi, it's really nice for the dataset format note, I did not find it before. Now I encounter another issue: Traceback (most recent call last): File "/opt/conda/envs/theia/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/conda/envs/theia/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/workspace/theia/src/theia/scripts/preprocessing/feature_extraction.py", line 324, in feature_extractor raise e File "/workspace/theia/src/theia/scripts/preprocessing/feature_extraction.py", line 320, in feature_extractor tar_writer.write(sample) File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 356, in write obj = self.encoder(obj) File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 251, in g return encode_based_on_extension(sample, handlers) File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 220, in encode_based_on_extension return { File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 221, in k: encode_based_on_extension1(v, k, handlers) for k, v in list(sample.items()) File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 189, in encode_based_on_extension1 raise ValueError("the values of metadata must be of string type") ValueError: the values of metadata must be of string type

I change the code and add :

                            for key, value in sample.items():
                                if not isinstance(value, (str, bytes)):
                                    print(f"Converting key: {key}, value: {value}, type: {type(value)}")
                                    sample[key] = str(value)
                            try:
                                tar_writer.write(sample)

but it still raise that error, do you know what's the possible reason? BTW, can I add your WeChat, if you have one.

elicassion commented 2 months ago

First, make sure you are using webdataset from webdataset@git+https://github.com/elicassion/webdataset.git@elicassion/fix_shuffle_bug or 0.2.90

Second, if the webdataset version is correct, please check the content of this sample variable. There could be something wrong with the content of sample. To be specific, there could be something wrong here: https://github.com/bdaiinstitute/theia/blob/d196266c5c255b1506fe2c14486bbea0a8207a09/src/theia/scripts/preprocessing/feature_extraction.py#L273

According to the error, I think a non-string value is given to key __key__, or other fields starting with _. This causes webdataset writer to think the value of this key belongs to metadata and is supposed to be str type. See https://github.com/webdataset/webdataset/blob/7e33c40825c5157a0c5c4def050d4692b491b68f/webdataset/writer.py#L187

zhlhlhlhl commented 2 months ago

Thanks for your kind analysis. You're right, as I use the local teacher models, so the path is like "/workspace/models/facebook/dinov2-large", and because model_names_legit = args.model.replace("/", "_") so the model_nameslegit begins with "" and cause the later problems. I modified it to : ` model_nameslegit = args.model.replace("/", "")

avoid the model_nameslegit start with ""

    if model_names_legit.startswith("_"):
        model_names_legit = model_names_legit[1:]`

Then it works. Thank you!

elicassion commented 2 months ago

I didn't consider the local teacher model case. It's great that you figured it out. That's cool! I will close the issue.

bdaiinstitute / theia

dataset_root meaning #7

avoid the model_nameslegit start with ""