Open yonomitt opened 1 year ago
The labels.tsv file can be found here: https://dagshub.com/DagsHub-Datasets/LAION-Aesthetics-V2-6.5plus/src/main/data/labels.tsv
And has 542,247 rows.
The workaround was to batch the metadata uploads:
annotations_file = 'labels.tsv'
all_metadata = []
with open(annotations_file) as f:
for row in tqdm(f.readlines()):
image, caption, score = row.split('\t')[:3]
all_metadata.append((image, {'caption': caption[:255], 'score': score}))
total = len(all_metadata)
batch = 1000
for start in tqdm(range(0, total, batch)):
data = all_metadata[start:start+batch]
with ds.metadata_context() as ctx, open(annotations_file) as f:
for image, metadata in data:
ctx.update_metadata(image, metadata)
I've copied the batching into the metadata upload, uploading it in batches of 5k points at a time. Hope that's good enough and we don't need any backend changes
As a stress test, I have a repo with 542,247 images in it and wanted to add metadata to a data source. I ran the following code from a Jupyter notebook:
The first time I ran this, it never returned (I waited several hours). The second time, I got a 502: