Closed zhlhlhlhl closed 2 months ago
Hi @zhlhlhlhl
Please refer to our dataset format note for the dataset structure. --dataset-root
is the top level directory to hold all your datasets, i.e., the <root_dir_to_hold_datasets>
in the aforementioned note. (Sorry for that the link to dataset format does not work on main README! I will fix it soon.)
The tar
files are good. You can move them to <root_dir_to_hold_datasets>/imagenet/images
. If you don't have images
folder, you can create one and move your tar
files under it. This folder means the images from this dataset. You don't need to untar the file. The generated features will go to the other folders at the same level of images
. For example, ViT will go to <root_dir_to_hold_datasets>/imagenet/google_vit-huge-patch14-224-in21k
Hi, it's really nice for the dataset format note, I did not find it before. Now I encounter another issue:
Traceback (most recent call last):
File "/opt/conda/envs/theia/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/envs/theia/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/theia/src/theia/scripts/preprocessing/feature_extraction.py", line 324, in feature_extractor
raise e
File "/workspace/theia/src/theia/scripts/preprocessing/feature_extraction.py", line 320, in feature_extractor
tar_writer.write(sample)
File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 356, in write
obj = self.encoder(obj)
File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 251, in g
return encode_based_on_extension(sample, handlers)
File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 220, in encode_based_on_extension
return {
File "/opt/conda/envs/theia/lib/python3.10/site-packages/webdataset/writer.py", line 221, in
I change the code and add :
for key, value in sample.items():
if not isinstance(value, (str, bytes)):
print(f"Converting key: {key}, value: {value}, type: {type(value)}")
sample[key] = str(value)
try:
tar_writer.write(sample)
but it still raise that error, do you know what's the possible reason? BTW, can I add your WeChat, if you have one.
First, make sure you are using webdataset
from webdataset@git+https://github.com/elicassion/webdataset.git@elicassion/fix_shuffle_bug
or 0.2.90
Second, if the webdataset
version is correct, please check the content of this sample
variable. There could be something wrong with the content of sample
. To be specific, there could be something wrong here: https://github.com/bdaiinstitute/theia/blob/d196266c5c255b1506fe2c14486bbea0a8207a09/src/theia/scripts/preprocessing/feature_extraction.py#L273
According to the error, I think a non-string value is given to key __key__
, or other fields starting with _
. This causes webdataset writer to think the value of this key belongs to metadata and is supposed to be str
type. See https://github.com/webdataset/webdataset/blob/7e33c40825c5157a0c5c4def050d4692b491b68f/webdataset/writer.py#L187
Thanks for your kind analysis. You're right, as I use the local teacher models, so the path is like "/workspace/models/facebook/dinov2-large", and because
model_names_legit = args.model.replace("/", "_")
so the model_nameslegit begins with "" and cause the later problems.
I modified it to :
` model_nameslegit = args.model.replace("/", "")
if model_names_legit.startswith("_"):
model_names_legit = model_names_legit[1:]`
Then it works. Thank you!
I didn't consider the local teacher model case. It's great that you figured it out. That's cool! I will close the issue.
when I run feature_extraction.py, it needs an augment '--dataset-root '; what's this stand for? Now I have the raw data path and the processed one containing a list of data like 'imagenet_train-001281-train.tar'. And the '--dataset-root' should have a subfolder named 'images'. Should I unzip the data and store them in '/image'?