Open 2320sharon opened 11 months ago
remove_duplicates
defined in coastsat removes duplicate shorelines from the Shoreline dictionary. It does not remove duplicate tiffs or jpegshandle_duplicate_image_names
defined in coastsat renames the duplicate tiffs by adding dup_X
to the file name. These duplicate files are images from the same satellite collection with the same image timestamp. These files are temporary tiffs that are later converted into the real tiffs that are saved.im_dict_T1["S2"] = filter_S2_collection(im_dict_T1["S2"])
filter_S2_collection
function deletes all the S2 imagery with the same time stamp and different UTM zones. Images with the same timestamp and the same UTM zone keep only the first one.Given that the S2 collection has its duplicates filtered out before the download ever begins makes me confused how duplicate jpegs for the S2 collection are even being generated. I think it will take some testing to figure out when this happens. It's also possible that each person has a different definition of what a duplicate image is. From the coastsat implementation it classifies a duplicate image as an image from the same satellite collection and timestamp that's less than 24 hours apart.
As I write this I realize the issue might not be the filtering technique but the fact that the collections are in two different tiers. It's possible that there are timestamps that are the same across both tiers for the same satellite and this would cause there to be duplicate imagery even though it should have been filtered out. While the S2 collection does not have two tiers the other satellites do, so I'm going to do some testing and see if duplicates are arising because of this.
These two jpegs 2018-12-03-18-39-48_RGB_L8.jpg
and 2018-12-03-18-40-12_RGB_L8.jpg
were captured on the same day
2018-12-03
and at almost the same time 18-39-48
and 18-40-12
would these images be considered duplicates since they are less than 24 hours apart? @dbuscombe-usgs
We'd want to keep both images. That's super valuable having images on consecutive days!
Duplicates are only when images are identical times
We'd want to keep both images. That's super valuable having images on consecutive days!
Duplicates are only when images are identical times
Ah good to know, thanks for helping me double check that.
When I ran the script below on 700 S2 images I downloaded I didn't find any duplicate imagery. Sometimes the images are only a few minutes apart, but other than that I'm not finding duplicates.
import os
from collections import defaultdict
from collections import Counter
file_list=os.listdir('/home/sha23/development/coastseg/CoastSeg/data/ID_kyg1_datetime10-02-23__03_11_52/jpg_files/preprocessed/RGB')
counter = Counter(file_list)
duplicates = {file: count for file, count in counter.items() if count > 1}
# Print the duplicates
for duplicate, count in duplicates.items():
print(f"Filename: {duplicate} - Count: {count}")
I ran this script across all the data I've downloaded and I didn't find any duplicates
import os
from collections import defaultdict
from collections import Counter
data_directory = r'C:\development\doodleverse\coastseg\CoastSeg\data'
roi_dirs = os.listdir(data_directory)
for roi_dir in roi_dirs:
jpeg_directory = os.path.join(roi_dir,"jpg_files","preprocessed","RGB")
if os.path.exists(jpeg_directory):
file_list=os.listdir(jpeg_directory)
counter = Counter(file_list)
duplicates = {file: count for file, count in counter.items() if count > 1}
# Print the duplicates
for duplicate, count in duplicates.items():
print(f"Filename: {duplicate} - Count: {count}")
@dbuscombe-usgs have you found duplicate imagery in any of the downloads you've performed?
I heard back from Catherine on the duplicate images issue and here is what she said:
I'm going through the images currently. It appears that the images may actually have unique IDs, but they were so extremely similar with the exact date, hour, and even minutes for some. There are some days with S2 that have 2-3 images for the same day and extremely similar times, hence I thought it was identical. Here is an example that is only 15 mins away from each other. I haven't seen it with Landsat yet. This happens with nearly 75% of S2 images after 2018.
So it seems there aren't identical images being generated just multiple images that are sometimes seconds apart. @dbuscombe-usgs do we want to keep these images that are minutes/seconds apart?
During the meeting today we addressed the confusion about "duplicates" images or more accurately images captured within a few minutes of each other or less. We determined that it would be easiest to make this a post-processing script that removes images that are less than a few minutes apart. It was suggested that this would be a script located in the scripts directory.
~~write a script the removes all images within a designated time frame from other imagery. @dbuscombe-usgs maybe images within 5-10 minutes of each other should be removed?~~
I think the point of the script would be for the user to specify what time period they like, and it should go in SDS-tools. It wouldn't filter out images, but shorelines
And no, we don't want to remove any imagery. The SDS tools script will remove duplicate shorelines. It will look at all the shorelines with X minutes (hours days whatever) of another one, and remove them
Description:
Users have found that duplicate images are being downloaded and used to extract shorelines. Downloading duplicate images has always been an issue with coastsat's download workflow, but the question is the process of removing the duplicates only happening to the tifs and not removing duplicate jpegs? Users are also wondering if its possible to modify the download workflow so that duplicates are detected before they are downloaded so that downloading duplicate images does not further slow down the downloads. Issues with duplicates are most prevalent with S2 imagery. This has led to significant delays in download times, impacting user experience and overall workflow efficiency.
Concerns:
Tasks:
Acceptance Criteria: