YaleDHLab / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
593 stars 139 forks source link

Improper duplicate name validation in filter_images() #270

Closed CarloOtano closed 1 year ago

CarloOtano commented 1 year ago

The filter_images function is supposed to validate that the image names are unique, but it is currently validating that the image paths are unique.

For example the following folder structure will not raise an error: jpg_parent_folder ---> jpg_sub_folder_1 -------> imageA.jpg -------> imageB.jpg ---> jpg_sub_folder_2 -------> imageA.jpg -------> imageC.jpg

Since the images location can be provided as a glob, more than one sub directory can be specified: --images "jpg_parent_folder/*/.jpg"

The although the names are not unique, the image paths are unique:

One simple solution might be:

for i in stream_images(image_paths=get_image_paths(**kwargs)):
    filename = clean_filename(i.path)
    if filename in image_paths:
      duplicates.add(filename)
    image_paths.add(filename)

https://github.com/YaleDHLab/pix-plot/blob/48c8e6b3732ada3291ee2f6ebffce934fe4faee2/pixplot/pixplot.py#L195-L207

duhaime commented 1 year ago

Yes this has been a challenge in the past.

When we copy the filenames to the output directory, do we use just the filename or the full path?

If the latter, we should also use the full path when deduping just as you suggest. If we copy just the basename, we'll need to use the full path when copying the images to the output (like replacing / with - or similar).

CarloOtano commented 1 year ago

When we copy the filenames to the output directory, do we use just the filename or the full path?

You use just the filename when you copy the files. The filename are used in other areas such as the imagelist (and other places).

If the latter, we should also use the full path when deduping just as you suggest...

I am not suggesting we use the full path. Although the ideal solution might be to modify the code to allow the use of files with the same name, that is a more involved process.

For now I am only suggesting we modify the validation to catch duplicate basenames with different paths.

duhaime commented 1 year ago

Ah roger that--makes sense. If you'd want to send up a diff with the code you posted when you created this issue, we can get it merged!

(P.S. Thank you for your help with the project--the team has scattered and resources are limited right now, but it's great to improve the software when there are easy wins to be had!)

CarloOtano commented 1 year ago

Please review PR #271 and let me know if you need additional changes.

(P.S. Thank you all for taking the time to create and improve this project!)

CarloOtano commented 1 year ago

Issues solved in PR #271