Open kalashjindal opened 1 year ago
Added multithreading for downloading the images much faster
import numpy as np import pandas as pd import requests import os import threading
dress_patterns_df = pd.read_csv('dress_patterns.csv') dress_patterns = dress_patterns_df.values
category = set(dress_patterns_df['category']) print(category)
print(os.listdir()) os.mkdir('dataset_category')
for cat in category: print(cat) os.mkdir('dataset_category/'+cat)
print(os.listdir('dataset_category'))
def download_image(url, category, unit_id, i): try: r = requests.get(url, allow_redirects=True) open('dataset_category/'+category+'/'+str(unit_id)+'.jpg', 'wb').write(r.content) except: print('ERROR at: ', i)
threads = [] for i in range(len(dress_patterns)): if i%5 == 0: print(i, '/', len(dress_patterns)) pattern = dress_patterns[i] url = pattern[3] unit_id = pattern[0] category = pattern[1] thread = threading.Thread(target=download_image, args=(url, category, unit_id, i)) threads.append(thread) thread.start()
# limit the number of threads to 5
if len(threads) == 5:
for thread in threads:
thread.join()
threads = []
for thread in threads: thread.join()
Around 15k images are present in the data csv, but only about 10k images in total are used in the notebook. The model was trained as a binary problem, but the real problem is a multi-calss one. The only folder created in create dataset is dataset category, but how is dataset category test used in notebooks? Receiving an accuracy of over 95% but not using other metrics to demonstrate it statistically is not a good thing.