Polygonize improvements

cholmes commented 1 month ago

A few improvements mentioned in #27:

added a min_size parameter to filter out all the little slivers that usually aren't real fields. Default value of 500, which is about 5 pixels of sentinel 2. Requires an area calculation with reprojection which does slow things down a good bit, but seems like it may be worth it as the results seem a good bit better.
Added an 'area' parameter. More of a side effect of the filtering, but will be useful.
Set a default simplification value of '15', as it seems to have pretty reasonable results.

Can see the difference, some places there's lots of these little ones that are just like a pixel big.

slivers-ftw

Would be good to also figure out the places where there's a one or two pixel hole in the polygon, seems like there should be something in shapely for it but I don't know it off hand. It does feel like a more sophisticated polygonization algorithm could do better, but it's pretty cool that we can get some fairly reasonable results just from rasterio.

I haven't done extensive testing on the performance of the area calculation, but when I did first add it in I definitely noticed. But overall it's not egregiously long, so may be worth it. If it was just to add an attribute that not everyone cares about we could make it an option. But filtering out these tiny polygons seems pretty useful and for that I'm pretty sure you need the area calculation w/ reprojection (maybe could do a faster filtering out pass without projecting to 6933? But probably not worth the effort), so I lean towards just doing it for everything.

@calebrob6 - I still have your warning about values greater than 1 for simplification in (but commented out). Curious when that would happen, since all the input will be sentinel 2, in UTM, so seems like for now we could count on the units always being meters? I do think it could be interesting to evolve this to be a more general 'simplification of satellite imagery inference' thing, but that feels not immediate, and seems reasonable to just assume sentinel 2, and then if we start trying other imagery sources we can adjust as needed.

cholmes commented 1 month ago

This should have readme updates before merging... but I may just continue to work on it adding in more improvements from the list in #27

calebrob6 commented 1 month ago

Curious when that would happen, since all the input will be sentinel 2

Yeah it won't unless someone reprojects into EPSG:4326 (maybe downloading from GEE or something). I think it is safe to remove! (the check is an instinct from other projects where sometimes our inputs are EPSG:4326 and sometimes local UTM)

calebrob6 commented 1 month ago

I haven't done extensive testing on the performance of the area calculation, but when I did first add it in I definitely noticed. But overall it's not egregiously long, so may be worth it.

If it gets excessive then I think we can switch to a pyproj based approach-- I think I benchmarked fiona's transform and it was way slower but I can't find that.

Update: I remade it https://gist.github.com/calebrob6/2a2705027764d17fdbeb2b7a4cf4887c -- pyproj.Transformer is atleast ~2x faster.

But also, if we're already assuming that the inputs are going to be in the correct UTM projection, then it is probably fine to not reproject to 6933 (or anything else) to compute areas.

hannah-rae commented 1 month ago

@cholmes these changes LGTM! I agree it would be nice to find a way to fill the holes - though sometimes there are real holes in a field, it would be nice to have the option (just like the area filtering).

cholmes commented 1 month ago

I agree it would be nice to find a way to fill the holes - though sometimes there are real holes in a field, it would be nice to have the option (just like the area filtering).

Oh, didn't see this before my comment. Didn't realize there are valid 'holes in fields' use cases, but actually I guess that does make sense. I think I lean towards the on/off option that does Caleb's operation for now, and then ticket that we'd ideally like control just like in the 'simplify' case.

cholmes commented 1 month ago

Update: I remade it https://gist.github.com/calebrob6/2a2705027764d17fdbeb2b7a4cf4887c -- pyproj.Transformer is atleast ~2x faster.

Oh cool, I'll try to incorporate the code.

But also, if we're already assuming that the inputs are going to be in the correct UTM projection, then it is probably fine to not reproject to 6933 (or anything else) to compute areas.

Ah, good to know - I wasn't sure, and this bit me in the past. I do think eventually we may want to handle other projections, so might be good to keep the code around for our future selves or future others.

cholmes commented 4 weeks ago

You want to review this @m-mohr? I think we should just get it merged and not wait on the holes stuff - I made an issue (#39) for it.

Then you could likely take on #27 and that one.

m-mohr commented 4 weeks ago

Merging for now, this part of the CLI needs further work anyway. O(n³) complexity is not ideal... :-)

fieldsoftheworld / ftw-baselines

Polygonize improvements #31