Preprocessing of data - Githubissues

UNSAT3D / unsat

Input/Output tools for the UNSAT project

https://unsat3d.github.io/unsat/

Apache License 2.0

1 stars 0 forks source link

Preprocessing of data #1

Closed APJansen closed 7 months ago

APJansen commented 11 months ago

We will need to cut the data up into cubes (or potentially with a different height, but same dimensions in the horizontal plane). This can be done just with numpy. Some considerations we'll need to take into account:

We don't want the outside region where there is no sample to be included
We may either do this on the fly during training, or in advance, saving the data.
In the latter case, we have to consider how much overlap we do or don't want.
The size of the regions: small enough to fit in the working memory, big enough to capture enough information.
We may end up wanting 2D horizontal slices instead.

this function may be useful to convert the tif files to numpy.

PabRod commented 11 months ago

Hi @APJansen, just to make sure what do you mean by "the outside region"... would a "cube" like the red one in the figure be acceptable? Notice the upper right corner; although it contains data, there is no soil there.

output

APJansen commented 11 months ago

No that's what I wanted to avoid.

Although I suppose it is something to consider as well. The downside of including such a box is that we would need an additional class, and the model will have to learn something kind of useless. The upside is (minor) that this sampling of cubes will be simpler, and (perhaps major?) that it is easier to cover also the edges of the scan.

Anyway I think it would be good to at least have the option to avoid such regions.

PabRod commented 11 months ago

That was my guess. Anyway, filtering out those regions poses two problems that may be worse than adding an additional data label. Namely:

Automatic detection of the region of interest (without filtering out air inside the sample)
3D tesselation of a curved shape with parallelepipeds (probably not that hard if 1 is achieved and we don't care about dropping some close-to-the-border data)

Do you have any idea on how to efficiently tackle those problems?

APJansen commented 11 months ago

For the first point, I would say let's be wasteful to start with: chop off the first and last 200 horizontal slices, so that in the remaining the scan area is more or less constant. Then estimate its radius and make sure the corners are within that radius from the center of the scan. Later we can be more precise, but for now the goal is just to have any pipeline by which we can feed images into a model so we can do our first experiments. So let's not worry about the second point yet either.