Open kvantricht opened 2 months ago
Proposed folder structure:
Move extract_common.py
to src, more specifically to worldcereal.openeo.extract_common
worldcereal/worldcereal-classification/scripts/extractions/
should then look like:
├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_patch.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions └── extract_point.py
Some questions:
extract_common
in src
, is it still required to have the _common
suffix? worldcereal.openeo.extract
could be an option too?extract_patch.py
? With the proposed structure, it's still a bit hard for a user to understand all these extract_
scripts in the patch_extractions
folder, no?
- with
extract_common
insrc
, is it still required to have the_common
suffix?worldcereal.openeo.extract
could be an option too?
Agreed, I can change the naming
- for patch extractions, the entry script would be
extract_patch.py
? With the proposed structure, it's still a bit hard for a user to understand all theseextract_
scripts in thepatch_extractions
folder, no?
Agreed again, after some further thinking, the following makes sense (it's I think the original idea :) ): ├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions └── extract_point.py
extract.py
will be the main extraction script that can either call one of the patch extractions (meteo, optical, sar or worldcereal) or it can call the point extractions.
Starting to look good. What's maybe still confusing to me is that for patch extractions, the different scripts actually contain data cube creation function for specific versions of the patch extractor, with extract_worldcereal.py
the one doing preprocessed patch extractions. On the point side, extract_point.py
I think is the point equivalent of this, as it also gets the worldcereal preprocessed extractions. Shouldn't we align these then? Like:
├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions │ ├── extract_worldcereal.py
Or is it a bad idea to have duplicate names? Just want to have both of these as consistently named as possible.
Starting to look good. What's maybe still confusing to me is that for patch extractions, the different scripts actually contain data cube creation function for specific versions of the patch extractor, with
extract_worldcereal.py
the one doing preprocessed patch extractions. On the point side,extract_point.py
I think is the point equivalent of this, as it also gets the worldcereal preprocessed extractions. Shouldn't we align these then? Like:├── extract.py ├── patch_extractions │ ├── extract_meteo.py │ ├── extract_optical.py │ ├── extract_sar.py │ └── extract_worldcereal.py └── point_extractions │ ├── extract_worldcereal.py
Or is it a bad idea to have duplicate names? Just want to have both of these as consistently named as possible.
Hmm, the current naming is based on this class:
class ExtractionCollection(Enum):
"""Collections that can be extracted in the extraction scripts."""
SENTINEL1 = "SENTINEL1"
SENTINEL2 = "SENTINEL2"
METEO = "METEO"
WORLDCEREAL = "WORLDCEREAL"
POINTS = "POINTS"
Personally, I'd avoid having two extract_worldcereal
scripts, but we can maybe think of a more descriptive name for the worldcereal
(and maybe point
) script(s).
but what if you'd want for example to extract points for meteo?
Ah, so it should also be possible to extract points for one collection separately?
Not necessarily at the moment, I'm thinking of a setup that is future-proof so to say. Does it make sense?
Ok, makes sense, then I'll adjust the naming!
Should add a kwarg to worldcereal_preprocessed_inputs
that can trigger validate_temporal_context
mypy doesn't allow for two modules with the same name: _scripts/extractions/point_extractions/extract_worldcereal.py: error: Duplicate module named "extract_worldcereal" (also at "./scripts/extractions/patch_extractions/extractworldcereal.py"
So will change the name to extract_point.py
again.
I noticed that for point extractions we only split by S2 tile: https://github.com/WorldCereal/worldcereal-classification/blob/16e50a3fe19d9e6923417ccfc48af74e2e74f8b8/scripts/extractions/point_extractions/point_extractions.py#L241
While in the patch extractions we also split by year: https://github.com/WorldCereal/worldcereal-classification/blob/main/scripts/extractions/patch_extractions/extract.py#L98
Furthermore we're computing the mean
valid_time
to determine extraction window. In case samples span multiple years this is unsafe.It was already addressed in the
patch_extractions\extract.py
. After discussing with @GriffinBabe we think it's a good idea to step away from separate patch and point extraction folders and put thepoint_extractions.py
alongside the other extract scripts inpatch_extractions
, reusing as much as possible joint functionality.As a final step we can get rid of
patch_extractions
folder alltogether and keep things clear and simple for the user.