Apress / computer-vision-projects-with-pytorch

Source Code for "Computer Vision Projects with PyTorch" by Akshay Kulkarni, Adarsha Shivananda, and Nitin Ranjan Sharma
23 stars 17 forks source link

Chapter 5: failure while creating embeddings for the whole dataset #3

Closed andysingal closed 1 year ago

andysingal commented 1 year ago

Hi, While creating embeddings for the whole dataset it fails:

i was using: %%time
import swifter

# Applying embeddings on subset of this huge dataset
df_embeddings     = df #We can apply on entire df, like: df_embeddings = df

#looping through images to get embeddings
map_embeddings = df_embeddings['image'].swifter.apply(lambda img: vector_extraction(resnetmodel, img))

#convert to series
df_embs        = map_embeddings.apply(pd.Series)
print(df_embs.shape)
df_embs.head()

i get the error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3240, in open
    fp.seek(0)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/swifter/swifter.py", line 310, in apply
    tmp_df = func(sample, *args, **kwds)
  File "<timed exec>", line 8, in <lambda>
  File "/tmp/ipykernel_28/1610022205.py", line 10, in vector_extraction
    img = Image.open(img_path(image_id)).convert('RGB')
  File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3242, in open
    fp = io.BytesIO(fp.read())
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'read'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/magics/execution.py", line 1325, in time
    exec(code, glob, local_ns)
  File "<timed exec>", line 8, in <module>
  File "/opt/conda/lib/python3.10/site-packages/swifter/swifter.py", line 319, in apply
    timed = timeit.timeit(wrapped, number=N_REPEATS)
  File "/opt/conda/lib/python3.10/timeit.py", line 234, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/opt/conda/lib/python3.10/timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/opt/conda/lib/python3.10/site-packages/swifter/swifter.py", line 227, in wrapped
    self._obj.iloc[self._SAMPLE_INDEX].apply(func, convert_dtype=convert_dtype, args=args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "<timed exec>", line 8, in <lambda>
  File "/tmp/ipykernel_28/1610022205.py", line 10, in vector_extraction
    img = Image.open(img_path(image_id)).convert('RGB')
  File "/tmp/ipykernel_28/2131791179.py", line 20, in img_path
    return DATASET_PATH+"images/"+img
TypeError: can only concatenate str (not "float") to str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2105, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1396, in structured_traceback
    return FormattedTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1287, in structured_traceback
    return VerboseTB.structured_traceback(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1159, in structured_traceback
    formatted_exceptions += self.format_exception_as_a_whole(etype, evalue, etb, lines_of_context,
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1030, in format_exception_as_a_whole
    self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1122, in get_records
    FrameInfo(
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/ultratb.py", line 766, in __init__
    ix = inspect.getsourcelines(frame)
  File "/opt/conda/lib/python3.10/inspect.py", line 1121, in getsourcelines
    lines, lnum = findsource(object)
  File "/opt/conda/lib/python3.10/inspect.py", line 958, in findsource
    raise OSError('could not get source code')
OSError: could not get source code

i even tried gpu but it does not work. Have you tried working on it? Looking forward to hearing from you Thanks, Ankush Singal

nitinranjansharma commented 1 year ago

I was not able to replicate the issue, it worked fine for me. Are the image files accessible , can you break the function - vector_extraction(resnetmodel, img) once and check if you are able to extract features first?

andysingal commented 1 year ago

I was not able to replicate the issue, it worked fine for me. Are the image files accessible , can you break the function - vector_extraction(resnetmodel, img) once and check if you are able to extract features first?

Thanks for your instant reply, the images files are accessible. I tried running on Kaggle and it works fine when running on 5000 of them but gives issues when running on the whole dataset. Here is the link to my code: https://www.kaggle.com/code/alphasingal/fashion-dataset-recommendation

Screenshot 2023-06-21 at 9 54 16 PM

-dataset-recommendation

The problem is occuring in this part:

%%time
import swifter

# Applying embeddings on subset of this huge dataset
df_embeddings = df #We can apply on entire df, like: df_embeddings = df
​
#looping through images to get embeddings
map_embeddings = df_embeddings['image'].swifter.apply(lambda img: vector_extraction(resnetmodel, img))
​
#convert to series
df_embs        = map_embeddings.apply(pd.Series)
print(df_embs.shape)
df_embs.head()

The code is available on Kaggle, i would really appreciate if you can look into it. Thanks, Ankush Singal

nitinranjansharma commented 1 year ago

So in this version of data, there is an error with the 6696th data point, If you avoid that it will create the embeddings as you wanted