output formatting - Githubissues

tdivoll commented 1 year ago

Figure out how final outputs need to be saved as and where?

HDF5 is likely the best option!

tdivoll commented 1 year ago

HDF5 per each image with the floe props and array of masks Also should include image metadata, etc.

Tracker could still operate on the info saved in memory.

Add a flag for the user to either keep source images, or wipe source images at the end of the run.

cpaniaguam commented 1 year ago

@ellenbuckley Regarding persisting result objects (segmented images, floe property tables, time deltas,...) in hdf5, would it be reasonable to have a group for each result? I am envisioning something like this:

🗂️ HDF5.File: (read-only) myfile.h5
├─ 📂 floe_props
│  ├─ 🏷️ Description
│  ├─ 🔢 1
│  ├─ 🔢 2
│  ├─ 🔢 3
│  ├─ 🔢 4
│  ├─ 🔢 5
│  ├─ 🔢 6
│  ├─ 🔢 7
│  ├─ 🔢 8
│  └─ 🔢 features
├─ 📂 passtimes
│  ├─ 🏷️ Description
│  └─ 🔢 passtimes
├─ 📂 segmented_floes
│  ├─ 🏷️ Description 
│  ├─ 🔢 1
│  ├─ 🔢 2
│  ├─ 🔢 3
│  ├─ 🔢 4
│  ├─ 🔢 5
│  ├─ 🔢 6
│  ├─ 🔢 7
│  └─ 🔢 8
└─ 📂 timedeltas
   ├─ 🏷️ Description 
   └─ 🔢 timedeltas

Any feedback is greatly appreciated!

ellenbuckley commented 1 year ago

@cpaniaguam yes. i think the floe properties could be a description and a single dataset- a table of the properties for all floes within the image. but i like the other groups. and within the segmented floes group- each number is a floe array with the name of that dataset corresponding to some identifying name in the floe properties table. and then somewhere in the file having the name of the image, the time, the date produced, and other necessary metadata. we can work through this at our next meeting if that would be helpful!

cpaniaguam commented 1 year ago

@cpaniaguam yes. i think the floe properties could be a description and a single dataset- a table of the properties for all floes within the image. but i like the other groups. and within the segmented floes group- each number is a floe array with the name of that dataset corresponding to some identifying name in the floe properties table. and then somewhere in the file having the name of the image, the time, the date produced, and other necessary metadata. we can work through this at our next meeting if that would be helpful!

@ellenbuckley I think discussing this during the next meeting is a great idea!

danielmwatkins commented 1 year ago

@cpaniaguam in your file tree, are 1, 2, 3, 4, ... different source images or are they different floes?

If there is a separate file per image, then I'm not sure what the role of the timedeltas part is. Wouldn't there just be a single time stamp associated with each image?

cpaniaguam commented 1 year ago

@danielmwatkins In that structure the segmented_floes group contains segmented floe images according to the order they were processed as suggested by the satellite pass times (the time stamps), passtimes, which is in its own group. The timedeltas file has the computed time differences elapsed from image_i to image_i+1. I am guessing the time deltas might be useful for computing speeds? Floe drag/displacement estimates are produced by the pipeline's track task.

Here is an example of the contents the passtimes and timedeltas files might have:

julia> passtimes # time stamps fron the SOIT tool
6×3 Matrix{String}:
 "Date"        "Aqua pass time"  "Terra pass time"
 "05-04-2022"  "11:38:49"        "14:28:09"
 "05-05-2022"  "10:43:37"        "15:10:59"
 "05-06-2022"  "11:26:19"        "14:15:50"
 "05-07-2022"  "10:31:07"        "14:58:40"
 "05-08-2022"  "11:13:48"        "15:41:28"

julia> timedeltas # in minutes
9-element Vector{Float64}:
  169.0
 1215.0
  267.0
 1215.0
  170.0
 1215.0
  268.0
 1215.0
  268.0

Note that is not clear in what order the images should be processed (for the tracker). The pipeline sorts the time stamps and associates each to the correct image.

julia> better_passtimes # timedeltas are computed from these
10×2 DataFrame
 Row │ sat     pass_time
     │ String  DateTime
─────┼─────────────────────────────
   1 │ aqua    2022-05-04T11:38:49
   2 │ terra   2022-05-04T14:28:09
   3 │ aqua    2022-05-05T10:43:37
   4 │ terra   2022-05-05T15:10:59
   5 │ aqua    2022-05-06T11:26:19
   6 │ terra   2022-05-06T14:15:50
   7 │ aqua    2022-05-07T10:31:07
   8 │ terra   2022-05-07T14:58:40
   9 │ aqua    2022-05-08T11:13:48
  10 │ terra   2022-05-08T15:41:28

If neither is useful, we can choose not to include those output files. Just let us know what you'd like to have in the h5 file.

danielmwatkins commented 1 year ago

So to be clear, you are talking about a single hd5 file containing the analysis of multiple images, right? I think that we should simply have a "datetime" group with the calculated overpass times, something like

Myfile.hd5

Filename -- 1 filename 1 -- 2 filename 2 -- 3 filename 3 -- ...
Datetime -- 1 overpass time for filename 1 -- 2 overpass time for filename 2 -- etc
Floe properties -- 1 properties matrix for file 1 -- 2 properties matrix for file 2 -- etc
Segmented images -- 1 segmented geotiff file for file 1 -- 2 segmented geotiff file for file 2
etc

Personally I think the overpass time itself is the important thing, rather than the delta - the delta can be computed internal and is of course crucial for calculating velocity. Saving the deltas alone leads to problems because the deltas only have meaning if you know the exact reference time and the order of images. Whereas if you know the time stamp, there is no ambiguity, and it is trivial to recover the time delta from the dates.

On Thu, Jun 22, 2023 at 4:19 PM Carlos Paniagua @.***> wrote:

@danielmwatkins https://github.com/danielmwatkins In that structure the segmented_floes group contains segmented floe images according to the order they were processed as suggested by the satellite pass times (the time stamps), passtimes, which is in its own group. The timedeltas file has the computed time differences elapsed from image_i to image_i+1. I am guessing the time deltas might be useful for computing speeds? Floe drag/displacement estimates are produced by the pipeline's track task.

Here is an example of the contents the passtimes and timedeltas files might have:

julia> passtimes # time stamps fron the SOIT tool6×3 Matrix{String}: "Date" "Aqua pass time" "Terra pass time" "05-04-2022" "11:38:49" "14:28:09" "05-05-2022" "10:43:37" "15:10:59" "05-06-2022" "11:26:19" "14:15:50" "05-07-2022" "10:31:07" "14:58:40" "05-08-2022" "11:13:48" "15:41:28"

julia> timedeltas # in minutes9-element Vector{Float64}: 169.0 1215.0 267.0 1215.0 170.0 1215.0 268.0 1215.0 268.0

Note that is not clear in what order the images should be processed (for the tracker). The pipeline sorts the time stamps and associates each to the correct image.

julia> better_passtimes # timedeltas are computed from these10×2 DataFrame Row │ sat pass_time │ String DateTime ─────┼───────────────────────────── 1 │ aqua 2022-05-04T11:38:49 2 │ terra 2022-05-04T14:28:09 3 │ aqua 2022-05-05T10:43:37 4 │ terra 2022-05-05T15:10:59 5 │ aqua 2022-05-06T11:26:19 6 │ terra 2022-05-06T14:15:50 7 │ aqua 2022-05-07T10:31:07 8 │ terra 2022-05-07T14:58:40 9 │ aqua 2022-05-08T11:13:48 10 │ terra 2022-05-08T15:41:28

If neither is useful, we can choose not to include those output files. Just let us know what you'd like to have in the h5 file.

— Reply to this email directly, view it on GitHub https://github.com/WilhelmusLab/ice-floe-tracker-pipeline/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6TAUKE47FC5PNGHWYR6F3XMSSELANCNFSM6AAAAAAYXNOI44 . You are receiving this because you were mentioned.Message ID: @.***>

tdivoll commented 1 year ago

One thought here: for the purposes of pushing out the results in an hdf5, would it be useful to join the timedeltas with the passtimes? I was thinking something like this abbreviated example from above. Would it be intuitive to most users that the timedelta is relevant to the previous row?

Row  │  sat          pass_time           timedelta
      │  String        DateTime
─────────────────────────────────────────────────────
   1  │ aqua    |  2022-05-04T11:38:49   |   0.0
   2  │ terra   |  2022-05-04T14:28:09   |   169.0
   3  │ aqua    |  2022-05-05T10:43:37   |   1215.0
   4  | terra   |  2022-05-05T15:10:59   |   267.0

danielmwatkins commented 1 year ago

Thinking about file sizes and running things in batches, and thinking about how NASA provides their analyzed satellite output - I think that we should have separate hd5 files for each image. That way if the job is terminated, or we want to run another batch of images, we don't have to rewrite the file every time. Also consider what the file size would be if we're running 20 years worth of data. You don't want to have a single hd5 file that's hundreds of gigabytes.

I'm still not convinced that the timedelta is useful as an output on its own, it's really the time stamp itself that is useful. The times are used in figuring out which images to search for possible matches, and for constructing trajectories. The timedelta is used for determining search thresholds for correlation and size difference, and for limiting the size of gaps. Let's say you're looking to find a match for a floe found in image

At first you look in image 6, so you adjust the settings to match delta_t(6) = time(6) - time(5). If you don't find a match in image 6, you move forward to image 7. The settings then need to be adjusted to time(7)-time(5) = delta_t(6) + delta_t(5). If using delta_t, you have to do cumulative sums each time, whereas if using just the time stamps, you only need to use the two time stamps, regardless of the number of images between. All the information you need is already contained in the image time stamps - having a separate delta_t vector is redundant.

What might make most sense is to have an output file with metadata. That could be a single dataframe with columns "datetime", "satellite", "true_color_filename", "false_color_filename", "output_filename", "number_floes_detected", where each row is a different analyzed image, and an output folder that has hd5 files for each analyzed image.

The tracked output would be different. The maximum length of trajectories is equal to the number of images within each summer season, hence there can be a single file per year. For consistency this can also be an hd5 file, though it would also work as a large CSV file. The tracked output would have a shared index with the time stamps of the original images, then each tracked floe would have an array with columns for the floe properties (centroid, axes, major axis orientation, rotation since previous image, original floe id). The original floe id would be some way to look up the image the floe was found in so you could find the floe shape in the floe library.

On Fri, Jun 23, 2023 at 8:34 AM Timothy Divoll @.***> wrote:

One thought here: for the purposes of pushing out the results in an hdf5, would it be useful to join the timedeltas with the passtimes? I was thinking something like this abbreviated example from above. Would it be intuitive to most users that the timedelta is relevant to the previous row?

Row │ sat pass_time timedelta │ String DateTime ────────────────────────────────── 1 │ aqua | 2022-05-04T11:38:49 | 0.0 2 │ terra | 2022-05-04T14:28:09 | 169.0 3 │ aqua | 2022-05-05T10:43:37 | 1215.0 4 | terra | 2022-05-05T15:10:59 | 267.0

— Reply to this email directly, view it on GitHub https://github.com/WilhelmusLab/ice-floe-tracker-pipeline/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6TAUJMZKC6GX5IAXSUVB3XMWELRANCNFSM6AAAAAAYXNOI44 . You are receiving this because you were mentioned.Message ID: @.***>

cpaniaguam commented 1 year ago

Trying to come up with a summary of the conversation on Slack on the matter

File structure

[ ] Suggested structure for a data set (output of processing one basic set of input data -- truecolor/reflectance image pair, pass time) :card_index_dividers: HDF5.File: (read-only) myfile.h5 ├─ :label: filename reflectance ├─ :label: filename true color ├─ :label: IFT version ├─ :label: crs ├─ :label: reference: https://doi.org/10.1016/j.rse.2019.111406 ├─ :label: contact: mmwilhelmus@brown.edu ├─ :open_file_folder: index │ ├─ time (unix format) │ ├─ x (1d array) │ ├─ y (1d array) │ ├─ latitude (2d array) │ ├─ longitude (2d array) ├─ :open_file_folder: floe properties │ ├─ :1234: properties (matrix) │ ├─ :label: description of properties- ref the regionprops function in skimage
│ ├─ :1234: “column_names”: list of column names for the properties table- array of strings │ ├─ :1234: “units”: list of units corresponding to column names (array of strings) │ ├─ :1234: labeled image │ ├─ :label: description of labeled image- numbers correspond to floes in prop table
[ ] File name: YYYYmmddHHMMSS.sat.labeled_image.h5 where sat = ["aqua", ''terra", ....]
Floe props table

column names / units

"area" => "sqkm"
"latitude" # centroid
"longitude" # centroid
"convex_area" => "sqkm"
"major_axis" => "km"
"minor_axis" => "km"
"orientation" => "radians"
"perimeter" => "km"

[ ] Add note to perimeter: Perimeter of object which approximates the contour as a line through the centers of border pixels using a 4-connectivity.

cpaniaguam commented 1 year ago

Sample script to generate crs, x, y, latitude, logitude

import numpy as np
from pyproj import Transformer
import rasterio

image_loc = '../data_test/20170501.aqua.reflectance.250m.tiff'
im = rasterio.open(image_loc)
crs = im.crs
print('Coordinate reference system code: ', im.crs)

nrows, ncols = im.shape
rows, cols = np.meshgrid(np.arange(nrows), np.arange(ncols))
xs, ys = rasterio.transform.xy(im.transform, rows, cols)

# X and Y are the 1D index vectors
X = np.array(xs)[:,0] 
Y = np.array(ys)[0,:]

ps_to_ll = Transformer.from_crs(im.crs, 'WGS84', always_xy=True)
lons, lats = ps_to_ll.transform(np.ravel(xs), np.ravel(ys))

# longitude and latitude are 2D index arrays
longitude = np.reshape(lons, (nrows, ncols))
latitude = np.reshape(lats, (nrows, ncols))

WilhelmusLab / ice-floe-tracker-pipeline

output formatting #28

File structure

Floe props table