Closed tdivoll closed 1 year ago
HDF5 per each image with the floe props and array of masks Also should include image metadata, etc.
Tracker could still operate on the info saved in memory.
Add a flag for the user to either keep source images, or wipe source images at the end of the run.
@ellenbuckley Regarding persisting result objects (segmented images, floe property tables, time deltas,...) in hdf5, would it be reasonable to have a group for each result? I am envisioning something like this:
ποΈ HDF5.File: (read-only) myfile.h5
ββ π floe_props
β ββ π·οΈ Description
β ββ π’ 1
β ββ π’ 2
β ββ π’ 3
β ββ π’ 4
β ββ π’ 5
β ββ π’ 6
β ββ π’ 7
β ββ π’ 8
β ββ π’ features
ββ π passtimes
β ββ π·οΈ Description
β ββ π’ passtimes
ββ π segmented_floes
β ββ π·οΈ Description
β ββ π’ 1
β ββ π’ 2
β ββ π’ 3
β ββ π’ 4
β ββ π’ 5
β ββ π’ 6
β ββ π’ 7
β ββ π’ 8
ββ π timedeltas
ββ π·οΈ Description
ββ π’ timedeltas
Any feedback is greatly appreciated!
@cpaniaguam yes. i think the floe properties could be a description and a single dataset- a table of the properties for all floes within the image. but i like the other groups. and within the segmented floes group- each number is a floe array with the name of that dataset corresponding to some identifying name in the floe properties table. and then somewhere in the file having the name of the image, the time, the date produced, and other necessary metadata. we can work through this at our next meeting if that would be helpful!
@cpaniaguam yes. i think the floe properties could be a description and a single dataset- a table of the properties for all floes within the image. but i like the other groups. and within the segmented floes group- each number is a floe array with the name of that dataset corresponding to some identifying name in the floe properties table. and then somewhere in the file having the name of the image, the time, the date produced, and other necessary metadata. we can work through this at our next meeting if that would be helpful!
@ellenbuckley I think discussing this during the next meeting is a great idea!
@cpaniaguam in your file tree, are 1, 2, 3, 4, ... different source images or are they different floes?
If there is a separate file per image, then I'm not sure what the role of the timedeltas part is. Wouldn't there just be a single time stamp associated with each image?
@danielmwatkins In that structure the segmented_floes
group contains segmented floe images according to the order they were processed as suggested by the satellite pass times (the time stamps), passtimes
, which is in its own group. The timedeltas
file has the computed time differences elapsed from image_i
to image_i+1
. I am guessing the time deltas might be useful for computing speeds? Floe drag/displacement estimates are produced by the pipeline's track
task.
Here is an example of the contents the passtimes
and timedeltas
files might have:
julia> passtimes # time stamps fron the SOIT tool
6Γ3 Matrix{String}:
"Date" "Aqua pass time" "Terra pass time"
"05-04-2022" "11:38:49" "14:28:09"
"05-05-2022" "10:43:37" "15:10:59"
"05-06-2022" "11:26:19" "14:15:50"
"05-07-2022" "10:31:07" "14:58:40"
"05-08-2022" "11:13:48" "15:41:28"
julia> timedeltas # in minutes
9-element Vector{Float64}:
169.0
1215.0
267.0
1215.0
170.0
1215.0
268.0
1215.0
268.0
Note that is not clear in what order the images should be processed (for the tracker). The pipeline sorts the time stamps and associates each to the correct image.
julia> better_passtimes # timedeltas are computed from these
10Γ2 DataFrame
Row β sat pass_time
β String DateTime
ββββββΌβββββββββββββββββββββββββββββ
1 β aqua 2022-05-04T11:38:49
2 β terra 2022-05-04T14:28:09
3 β aqua 2022-05-05T10:43:37
4 β terra 2022-05-05T15:10:59
5 β aqua 2022-05-06T11:26:19
6 β terra 2022-05-06T14:15:50
7 β aqua 2022-05-07T10:31:07
8 β terra 2022-05-07T14:58:40
9 β aqua 2022-05-08T11:13:48
10 β terra 2022-05-08T15:41:28
If neither is useful, we can choose not to include those output files. Just let us know what you'd like to have in the h5 file.
So to be clear, you are talking about a single hd5 file containing the analysis of multiple images, right? I think that we should simply have a "datetime" group with the calculated overpass times, something like
Myfile.hd5
Personally I think the overpass time itself is the important thing, rather than the delta - the delta can be computed internal and is of course crucial for calculating velocity. Saving the deltas alone leads to problems because the deltas only have meaning if you know the exact reference time and the order of images. Whereas if you know the time stamp, there is no ambiguity, and it is trivial to recover the time delta from the dates.
On Thu, Jun 22, 2023 at 4:19β―PM Carlos Paniagua @.***> wrote:
@danielmwatkins https://github.com/danielmwatkins In that structure the segmented_floes group contains segmented floe images according to the order they were processed as suggested by the satellite pass times (the time stamps), passtimes, which is in its own group. The timedeltas file has the computed time differences elapsed from image_i to image_i+1. I am guessing the time deltas might be useful for computing speeds? Floe drag/displacement estimates are produced by the pipeline's track task.
Here is an example of the contents the passtimes and timedeltas files might have:
julia> passtimes # time stamps fron the SOIT tool6Γ3 Matrix{String}: "Date" "Aqua pass time" "Terra pass time" "05-04-2022" "11:38:49" "14:28:09" "05-05-2022" "10:43:37" "15:10:59" "05-06-2022" "11:26:19" "14:15:50" "05-07-2022" "10:31:07" "14:58:40" "05-08-2022" "11:13:48" "15:41:28"
julia> timedeltas # in minutes9-element Vector{Float64}: 169.0 1215.0 267.0 1215.0 170.0 1215.0 268.0 1215.0 268.0
Note that is not clear in what order the images should be processed (for the tracker). The pipeline sorts the time stamps and associates each to the correct image.
julia> better_passtimes # timedeltas are computed from these10Γ2 DataFrame Row β sat pass_time β String DateTime ββββββΌβββββββββββββββββββββββββββββ 1 β aqua 2022-05-04T11:38:49 2 β terra 2022-05-04T14:28:09 3 β aqua 2022-05-05T10:43:37 4 β terra 2022-05-05T15:10:59 5 β aqua 2022-05-06T11:26:19 6 β terra 2022-05-06T14:15:50 7 β aqua 2022-05-07T10:31:07 8 β terra 2022-05-07T14:58:40 9 β aqua 2022-05-08T11:13:48 10 β terra 2022-05-08T15:41:28
If neither is useful, we can choose not to include those output files. Just let us know what you'd like to have in the h5 file.
β Reply to this email directly, view it on GitHub https://github.com/WilhelmusLab/ice-floe-tracker-pipeline/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6TAUKE47FC5PNGHWYR6F3XMSSELANCNFSM6AAAAAAYXNOI44 . You are receiving this because you were mentioned.Message ID: @.***>
One thought here: for the purposes of pushing out the results in an hdf5, would it be useful to join the timedeltas with the passtimes? I was thinking something like this abbreviated example from above. Would it be intuitive to most users that the timedelta is relevant to the previous row?
Row β sat pass_time timedelta
β String DateTime
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β aqua | 2022-05-04T11:38:49 | 0.0
2 β terra | 2022-05-04T14:28:09 | 169.0
3 β aqua | 2022-05-05T10:43:37 | 1215.0
4 | terra | 2022-05-05T15:10:59 | 267.0
Thinking about file sizes and running things in batches, and thinking about how NASA provides their analyzed satellite output - I think that we should have separate hd5 files for each image. That way if the job is terminated, or we want to run another batch of images, we don't have to rewrite the file every time. Also consider what the file size would be if we're running 20 years worth of data. You don't want to have a single hd5 file that's hundreds of gigabytes.
I'm still not convinced that the timedelta is useful as an output on its own, it's really the time stamp itself that is useful. The times are used in figuring out which images to search for possible matches, and for constructing trajectories. The timedelta is used for determining search thresholds for correlation and size difference, and for limiting the size of gaps. Let's say you're looking to find a match for a floe found in image
What might make most sense is to have an output file with metadata. That could be a single dataframe with columns "datetime", "satellite", "true_color_filename", "false_color_filename", "output_filename", "number_floes_detected", where each row is a different analyzed image, and an output folder that has hd5 files for each analyzed image.
The tracked output would be different. The maximum length of trajectories is equal to the number of images within each summer season, hence there can be a single file per year. For consistency this can also be an hd5 file, though it would also work as a large CSV file. The tracked output would have a shared index with the time stamps of the original images, then each tracked floe would have an array with columns for the floe properties (centroid, axes, major axis orientation, rotation since previous image, original floe id). The original floe id would be some way to look up the image the floe was found in so you could find the floe shape in the floe library.
On Fri, Jun 23, 2023 at 8:34β―AM Timothy Divoll @.***> wrote:
One thought here: for the purposes of pushing out the results in an hdf5, would it be useful to join the timedeltas with the passtimes? I was thinking something like this abbreviated example from above. Would it be intuitive to most users that the timedelta is relevant to the previous row?
Row β sat pass_time timedelta β String DateTime ββββββββββββββββββββββββββββββββββ 1 β aqua | 2022-05-04T11:38:49 | 0.0 2 β terra | 2022-05-04T14:28:09 | 169.0 3 β aqua | 2022-05-05T10:43:37 | 1215.0 4 | terra | 2022-05-05T15:10:59 | 267.0
β Reply to this email directly, view it on GitHub https://github.com/WilhelmusLab/ice-floe-tracker-pipeline/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6TAUJMZKC6GX5IAXSUVB3XMWELRANCNFSM6AAAAAAYXNOI44 . You are receiving this because you were mentioned.Message ID: @.***>
Trying to come up with a summary of the conversation on Slack on the matter
YYYYmmddHHMMSS.sat.labeled_image.h5
where sat = ["aqua", ''terra", ....]
"area" => "sqkm"
"latitude" # centroid
"longitude" # centroid
"convex_area" => "sqkm"
"major_axis" => "km"
"minor_axis" => "km"
"orientation" => "radians"
"perimeter" => "km"
Sample script to generate crs, x, y, latitude, logitude
import numpy as np
from pyproj import Transformer
import rasterio
image_loc = '../data_test/20170501.aqua.reflectance.250m.tiff'
im = rasterio.open(image_loc)
crs = im.crs
print('Coordinate reference system code: ', im.crs)
nrows, ncols = im.shape
rows, cols = np.meshgrid(np.arange(nrows), np.arange(ncols))
xs, ys = rasterio.transform.xy(im.transform, rows, cols)
# X and Y are the 1D index vectors
X = np.array(xs)[:,0]
Y = np.array(ys)[0,:]
ps_to_ll = Transformer.from_crs(im.crs, 'WGS84', always_xy=True)
lons, lats = ps_to_ll.transform(np.ravel(xs), np.ravel(ys))
# longitude and latitude are 2D index arrays
longitude = np.reshape(lons, (nrows, ncols))
latitude = np.reshape(lats, (nrows, ncols))
Figure out how final outputs need to be saved as and where?
HDF5 is likely the best option!