ESA-PhiLab / Major-TOM

Expandable Datasets for Earth Observation
https://huggingface.co/Major-TOM
129 stars 7 forks source link

Downloaded tif files are black #8

Open blumenstiel opened 2 months ago

blumenstiel commented 2 months ago

I downloaded some data and noticed that some S2 data is completely black, e.g., grid cell 207D_1378R or 438U_1009R. The S1 data looks fine.

I used the filter_download function that is provided in this repo, I tested with and without by_row. I also tested Image.open(BytesIO(table[col][0].as_py())).show() with the same result.

The tif files do not include a FillValue. I assume 0 is used for NaN values?

Is it possible that some data got corrupted during the download or upload to HF?

mikonvergence commented 2 months ago

Hi @blumenstiel - thanks for bringing this up! I had a look too and it does seem like these two cells are indeed corrupted.

We made no changes to the original values, so like in the original Sentinel-2 data, 0 should represent no data (as far as I'm aware).

It is somewhat unlikely that the corruption occurred during the upload, so we will investigate soon. If needed we can update the corresponding parquet file.

Are there more files that are completely black that you found?

blumenstiel commented 2 months ago

Hi @mikonvergence, thanks for looking into it!

I checked another 100 random samples and got 14 corrupted files:

,grid_cell
0,171D_798L
1,160D_805L
2,143D_811L
3,142D_810L
4,142D_803L
5,138D_800L
6,133D_803L
7,128D_793L
8,117D_811L
9,113D_786L
10,110D_813L
11,107D_796L
12,94D_810L
13,451U_259L

So I assume that this potentially affects 10-20% of the gird cells. I did not manually check the samples but based on my code, each of these grid cell should either have only NaN values in S1 or S2.

Maybe add a quick check after downloading/before uploading to your processing scripts?

aliFrancis commented 2 months ago

Hi, we're looking into this! Thanks for bringing to our attention.

Doing some digging, there is a small percentage of S2 tiles (1.3%) which have 100% no-data (==0). I guess you got very unlucky, or something about your search made them more likely? Regardless, not sure why this has happened in the first place and why it got past our checks. Seems that all the IDs you list here have nodata==1.0 in the metadata (except the last grid tile, which I manually verified and it has an image over the sea, albeit a dark one). So, for now, I recommend explicitly filtering out tiles with 100% nodata percentage (the value is a ratio between 0-1, as sometimes we get images that are partially nodata).

image

As I say, thanks for bringing this to our attention, we will look into correcting/removing these!!

blumenstiel commented 2 months ago

Thank you @aliFrancis! I forgot to look at the no-data column, this explains a lot.