Consolidate extraction end points

kvantricht commented 2 months ago

I noticed that for point extractions we only split by S2 tile: https://github.com/WorldCereal/worldcereal-classification/blob/16e50a3fe19d9e6923417ccfc48af74e2e74f8b8/scripts/extractions/point_extractions/point_extractions.py#L241

While in the patch extractions we also split by year: https://github.com/WorldCereal/worldcereal-classification/blob/main/scripts/extractions/patch_extractions/extract.py#L98

Furthermore we're computing the mean valid_time to determine extraction window. In case samples span multiple years this is unsafe.

It was already addressed in the patch_extractions\extract.py. After discussing with @GriffinBabe we think it's a good idea to step away from separate patch and point extraction folders and put the point_extractions.py alongside the other extract scripts in patch_extractions, reusing as much as possible joint functionality.

As a final step we can get rid of patch_extractions folder alltogether and keep things clear and simple for the user.

VincentVerelst commented 6 days ago

Proposed folder structure:

Move extract_common.py to src, more specifically to worldcereal.openeo.extract_common

worldcereal/worldcereal-classification/scripts/extractions/ should then look like:

├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_patch.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions └── extract_point.py

kvantricht commented 6 days ago

Some questions:

with extract_common in src, is it still required to have the _common suffix? worldcereal.openeo.extract could be an option too?
for patch extractions, the entry script would be extract_patch.py? With the proposed structure, it's still a bit hard for a user to understand all these extract_ scripts in the patch_extractions folder, no?

VincentVerelst commented 6 days ago

with extract_common in src, is it still required to have the _common suffix? worldcereal.openeo.extract could be an option too?

Agreed, I can change the naming

for patch extractions, the entry script would be extract_patch.py? With the proposed structure, it's still a bit hard for a user to understand all these extract_ scripts in the patch_extractions folder, no?

Agreed again, after some further thinking, the following makes sense (it's I think the original idea :) ): ├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions └── extract_point.py

extract.py will be the main extraction script that can either call one of the patch extractions (meteo, optical, sar or worldcereal) or it can call the point extractions.

kvantricht commented 6 days ago

Starting to look good. What's maybe still confusing to me is that for patch extractions, the different scripts actually contain data cube creation function for specific versions of the patch extractor, with extract_worldcereal.py the one doing preprocessed patch extractions. On the point side, extract_point.py I think is the point equivalent of this, as it also gets the worldcereal preprocessed extractions. Shouldn't we align these then? Like:

├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions │ ├── extract_worldcereal.py

Or is it a bad idea to have duplicate names? Just want to have both of these as consistently named as possible.

VincentVerelst commented 6 days ago

Starting to look good. What's maybe still confusing to me is that for patch extractions, the different scripts actually contain data cube creation function for specific versions of the patch extractor, with extract_worldcereal.py the one doing preprocessed patch extractions. On the point side, extract_point.py I think is the point equivalent of this, as it also gets the worldcereal preprocessed extractions. Shouldn't we align these then? Like:

├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions │ ├── extract_worldcereal.py

Or is it a bad idea to have duplicate names? Just want to have both of these as consistently named as possible.

Hmm, the current naming is based on this class:

class ExtractionCollection(Enum):
    """Collections that can be extracted in the extraction scripts."""

    SENTINEL1 = "SENTINEL1"
    SENTINEL2 = "SENTINEL2"
    METEO = "METEO"
    WORLDCEREAL = "WORLDCEREAL"
    POINTS = "POINTS"

Personally, I'd avoid having two extract_worldcereal scripts, but we can maybe think of a more descriptive name for the worldcereal (and maybe point) script(s).

kvantricht commented 6 days ago

but what if you'd want for example to extract points for meteo?

VincentVerelst commented 6 days ago

Ah, so it should also be possible to extract points for one collection separately?

kvantricht commented 6 days ago

Not necessarily at the moment, I'm thinking of a setup that is future-proof so to say. Does it make sense?

VincentVerelst commented 6 days ago

Ok, makes sense, then I'll adjust the naming!

VincentVerelst commented 2 days ago

Should add a kwarg to worldcereal_preprocessed_inputs that can trigger validate_temporal_context

VincentVerelst commented 1 day ago

mypy doesn't allow for two modules with the same name: _scripts/extractions/point_extractions/extract_worldcereal.py: error: Duplicate module named "extract_worldcereal" (also at "./scripts/extractions/patch_extractions/extractworldcereal.py"

So will change the name to extract_point.py again.

WorldCereal / worldcereal-classification

Consolidate extraction end points #118