Define file and folder structure

JamesSample commented 1 year ago

We need to agree on a standard file and folder structure for SeaBee data. This needs to be simple enough to be human-readable (e.g. via the MinIO web interface), but structured enough for processing to be automated. Some considerations:

We should avoid ID codes or unique identifiers in file paths, as these are not human-readable
The system should allow for sensible groupings of data like, for example, the year => mission => flight hierarchy that is currently used. This will make it easier for people to find what they need
We need to allow for a range of data types and sources (aerial drones, otters, RGB, MS, HS etc.)

The system used by Sindre for seabirds is pretty straightforward and covers everything we need at present (I think). The system originally proposed by NIVA seems unnecessarily complicated.

As a starting point, how about something like this for the data from each flight:

region_area_yyyymmdd/
├─ annotation/
├─ dem/
├─ gcp/
├─ ground_truth/
├─ orthomosaic/
├─ other/
├─ raw_images/
config.json

Where:

region is coarse geographic descriptor (e.g. fylke) and area is more specific (e.g. an island, town or building name).
annotation (optional). Contains any relevant annotation not already in the database, such as geopackages etc.
dem (optional). Contains elevation datasets generated during orthorectification (DSMs and DTMs etc.).
gcp (optional). Ground control points in a standard text format. Note: The format Sindre uses for ODM is slightly different to that used by e.g. Spectrofly for Pix4D. Assuming both are supported by both applications, it would be good to pick one as standard, as then we can easily use ODM to reprocess data from Pix4D etc.
ground_truth (optional). Ground truth data.
orthomosaic (optional). Georeferenced mosaic images created by ODM or Pix4D. Ideally a single, multi-band GeoTiff
other (optional). Anything not included in the other folders (reports, logs etc.)
raw_images (required). Images from a single flight (i.e. images that can be orthorectified to produce a single mosaic). This could also include raw images from e.g. the otter, as long as they are to be stitched together
config.json (required?). I'm still not certain whether this is useful. However, this file could include settings and metadata to control subsequent processing. For example, it could include the Dronelogbook ID (in which case, additional metadata could be extracted via the API during processing), or it might include publication options (e.g. to control whether a dataset is made publicly available on GeoNode). Thoughts?

The general idea is that researchers could organise their data on MinIO more-or-less as they wish, as long as each flight is organised as illustrated above. For example, on MinIO we might have:

niva/
├─ 2022/
│  ├─ runde/
│  │  ├─ more-og-romsdal_remoy_20220831/
│  │  │  ├─ annotation/
│  │  │  ├─ dem/
│  │  │  ├─ gcp/
│  │  │  ├─ ground_truth/
│  │  │  ├─ orthomosaic/
│  │  │  ├─ other/
│  │  │  ├─ raw_images/
│  │  │  ├─ config.json
nina/
├─ 2022/
│  ├─ rogaland_risavika_20220627/
│  │  ├─ dem/
│  │  ├─ gcp/
│  │  ├─ orthomosaic/
│  │  ├─ other/
│  │  ├─ raw_images/
│  │  ├─ config.json

This is just meant as a starting point for discussion. What do you think, @knl88 @deviirmr @awigeon ?

deviirmr commented 1 year ago

I like the above folder structure and agree that we need to follow a simple human-readable name conventional. There are a few additions to the current structure based on the GUI interface we discuss in the past

niva/
├─ 2022/
│  ├─ runde/
│  │  ├─ more-og-romsdal_remoy_20220831/
│  │  │  ├─ annotation/
│  │  │  ├─ dem/
│  │  │  ├─ gcp/
│  │  │  ├─ gtp/
│  │  │  ├─ ground_truth/
│  │  │  ├─ orthomosaic/
│  │  │  ├─ raw_images/
│  │  │  ├─ drone_path/
│  │  │  ├─ other/
│  │  │  ├─ config.json

I have added gtp (Ground Truth Point) and drone_path (maybe geojson or kml) but we can rethink whether we needed those inputs or not.

In the config.json file, we can also include pre-define information such as

weather parameters
drone sensor detail (such as type of drone and sensor)
data contact point for the uploaded data (piolet/user name etc).
may be some configuration details for the machine learning

JamesSample commented 1 year ago

Thanks @deviirmr .

ground_truth is already included in my proposed structure. Or did you intend something more specific for the gtp folder you added?

knl88 commented 1 year ago

Think the below is fine, with no space,spesial characters etc. Following the logic, should it be ground-truth and raw-images, if underscore is used as seperator?

region_area_yyyymmdd/
├─ annotation/
├─ dem/
├─ gcp/
├─ ground_truth/
├─ orthomosaic/
├─ other/
├─ raw_images/
config.json

hmm, guess we can see what we need to put into config.json:) .. and would be cool with a small json schema! This could perhaps be a discussion on its own, but agree it could be some processing and/or publications options, for publishing it would also be nice to have a title and abstract then I think we could also generate the iso record fairly quickly. The later is slightly of topic if we just think the config.json as configuration for the pipeline, although one of the processing elements would need this as input also.

JamesSample commented 1 year ago

Good points!

I agree regarding ground-truth and raw-images.

awigeon commented 1 year ago

I feel a bit stuck in the odm way of doing it, but I will try to widen my view slightly to include niva data better.

We need to add somewhere for 3d models/texturing. Havent really used it yet, but if someone figure out a nice way to use them in the field/on a touchscreen it could be extremely useful.
For me it seems like the gcp files for pix4d and odm are different, and i dont think they accept each others files. They are easy to translate though.
report.pdf is very important to quickly see the results. I use stdout alot as well. Could either be report/ or other/
other/ feels like a place that could be very messy, but its outside any pipeline i guess, so that is okay for me
nina will probably not use annotation or ground-truth
Word changes: orthomosaic or orthophoto? raw-images or images?

region_area_yyyymmdd/
├─ annotation/
├─ dem/
├─ gcp/
│  ├─ gcp_list-ODM.txt
│  ├─ gcp_list-Pix4D.txt
├─ ground-truth/
├─ orthomosaic/
├─ other/
├─ raw-images/
├─ report/
│  ├─ report.pdf
│  ├─ stdout.txt
├─ texturing/
config.json

A native odm installation or docker version uses

├─ images/
gcp_list.txt

as its entire input. Do we want to follow that? I think I have concluded we dont have to, and to keep gcp in a folder might be more clean.

deviirmr commented 1 year ago

Thanks @deviirmr .

ground_truth is already included in my proposed structure. Or did you intend something more specific for the gtp folder you added?

Sorry, I miss the entries while reading the structure, you are right you already included the ground_truth (gtp)

deviirmr commented 1 year ago

niva/
├─ 2022/
│  ├─ runde/
│  │  ├─ more-og-romsdal_remoy_20220831/
│  │  │  ├─ annotation/
│  │  │  ├─ dem/
│  │  │  ├─ gcp/
│  │  │  ├─ ground_truth/
│  │  │  ├─ orthomosaic/
│  │  │  ├─ other/
│  │  │  ├─ raw_images/
│  │  │  ├─ config.json

@JamesSample : The year in this structure is the data upload year or its flight mission year?

JamesSample commented 1 year ago

Looks good @awigeon !

(Sorry for the slow reply).

I'm happy with all your suggested changes i.e.

region_area_yyyymmdd/
├─ annotation/
├─ dem/
├─ gcp/
│  ├─ gcp_list-ODM.txt
│  ├─ gcp_list-Pix4D.txt
├─ ground-truth/
├─ orthophoto/
├─ other/
├─ images/
├─ report/
│  ├─ report.pdf
│  ├─ stdout.txt
├─ texturing/
config.json

However, I've just had a quick look at reorganising some of the old NIVA data and it's going to be fiddly. In particular:

There are sometimes several flights at the same location on the same date. These are currently distinguished by time, which implies e.g. region_area_yyyymmdd-HHMM in the top-level folder name.
There are even some flights with the same location, date and time (presumably MS and RGB cameras on a single drone?). So then we need e.g. region_area_spec_yyyymmdd-HHMM.

What do you think? We could give it a go, but it feels like it's getting a bit messy. Or we could just use the structure above and then organise as e.g.

niva/
├─ year/
│  ├─ mission/
│  │  ├─ rgb/
│  │  │  ├─ region_area_yyyymmdd/
│  │  ├─ ms/
│  │  │  ├─ region_area_yyyymmdd/

Then there are cases where data from multiple flights has been combined into a single mosaic... argh!

@deviirmr

The year in this structure is the data upload year or its flight mission year?

It's the flight mission year.

awigeon commented 1 year ago

I also sometimes combine images from multiple flights into the same mosaic (and it also varies a bit how the drone handle battery swaps, sometimes it makes a new flight log). Usually that is simply done by just adding all photos from multiple flights into the images folder. But I see the problem when you want to connect to dronelogbook. This is also kind of a reason why i havent gone down that route.

With regards to multiple missions on the same site the same time: I have done that by region_area-1_yyyymmdd and region_area-2_yyyymmdd, where 1 and 2 could be whatever. Sometimes i have also rerun mosaics with new settings or gcps and just made a new "mission" that way. Could do the same with RGB vs MS.

HegeGundersen commented 1 year ago

Ah, this is so needed! Thank you @JamesSample for the initiative and all for good suggestions. My only comment at this point is that I think its very useful to have Project or Mission (e.g. MASSIMAL or KELPMAP) high up in the hierarchy (maybe instead of Region). In many cases the Mission is actually the Region, so then it will be the same.

awigeon commented 1 year ago

@HegeGundersen In secret I do this already, ie. writing project or some other grouping instead of an actual region. I would say region could be used more to group the flight in the folder, in however way you like. Dont really see a problem with that. Georeferencing comes from the files anyway.

@JamesSample I usually have solved this by simply

niva/
├─ year/
│  ├─  grouping_areaRGB_yyyymmdd/
│  ├─  grouping_areaMS_yyyymmdd/
│  ├─  grouping_areaRGB1_yyyymmdd/
│  ├─  grouping_areaRGB2_yyyymmdd/
│  ├─  grouping_areaMS1_yyyymmdd/
│  ├─  grouping_areaMS2_yyyymmdd/

I think that is a more clean folder structure, and think is is kind of flexible enough to cover all cases. It is however slightly hard to computer read. Maybe adding a _ or - somewhere

HegeGundersen commented 1 year ago

One more thing: often NIVA and SpectroFly joins the same campaigns. It will then be useful to have their datasets grouped and not in completely different institution folders. But this will maybe f... up the whole structure...?

JamesSample commented 1 year ago

@HegeGundersen If we can agree on a grouping/naming convention for the data from each unique flight-camera combination (something like the structure proposed by Sindre, above), I don't think this is a big problem in terms of reading data automatically - users can group the separate flight folders however they want, including e.g. having a single project folder with data from both NIVA and Spectrofly.

The code on the platform can just search for folders containing a file named e.g. seabee-config.json. Any folder containing a file like this would be identified as a "flight folder" and assumed to follow the subfolder structure outlined above, no matter how deeply nested it is in the "parent" folder structure.

Mixing data from different organisations does make it harder in terms of browsing folders manually, though. The original suggestion was to make separate "buckets" for each organisation (niva, nina, ntnu, spectrofly etc.) and then each would do something like what Sindre proposes in his post above. If we want the NIVA and Spectrofly data together, then for manual browsing we might need to add the organisation to the name of the flight folders e.g.

grouping_area_org_spec_num_yyyymmdd/

We can try it like this, but it gets increasingly fiddly and fragile. Ideally, I want to avoid having to extend the folder naming conventions to cover all the edge cases, because then I think we'll end up back with something like the current NIVA structure, which is hard to work with.

awigeon commented 1 year ago

I would vote strongly to keep it simple, and avoiding org as a required field. Does it really matter who it was flown by? What about having only three required fields, seperated by _, and then adding all the extra info after -

grouping_area-orgspecnum_yyyymmdd
Runde_Tarevågen-NIVARGB1_20220818

Or actually, if it just scans for folders with config file, the naming scheme of there parent folders probably doesnt matter at all.

JamesSample commented 1 year ago

I agree about keeping it simple.

In terms of automatic processing, none of this matters - the easiest option would probably be to have a random database ID for the folder name and all the metadata in config.json.

However, I think we also need something that is human-readable, because users still want to browse the data via e.g. the MinIO UI. In this context, it is important that people can identify which datasets are from NIVA and which from Spectrofly etc., without having to open config.json or explore other file metadata. Ideally, it should be obvious from the folder names.

I like your suggestion above. Or maybe just have the first three elements mandatory, then optional info afterwards

grouping_area_yyyymmdd_org-spec-num

?

medyang commented 1 year ago

Hi everybody,

medyang commented 1 year ago

Hi everybody,

I'm here with 2 cents :) :) ...

We often have multiple sites and /or multiple flights in one day/area/project. Would be good to have a folder with a specific area of interest that contains the processed/map data (pix4d for example)

James, latest post a from April here, would you be able to post the latest structure we currently use in MINIO so i can suggest some edits?

Cheers Medyan

JamesSample commented 1 year ago

Hi @medyang,

The latest structure is the one documented here. Rather than duplicate it in this issue, I'd rather just maintain one version in the documentation, but feel free to suggest changes here (and link to the webpage when necessary).

We often have multiple sites and /or multiple flights in one day/area/project.

Can you explain why this is a problem with the proposed structure, please? I don't immediately see any issue (and Sindre was recently flying up to 40 missions per day using the grouping_area_yyyymmddHHMM scheme).

If you have multiple sites within one region, then use grouping = region and area = site e.g. Runde_Remoy_yyyymmddHHMM. If you have multiple flights in the same area, just distinguish them by time (and number the areas, if you wish) e.g. Runde_Remoy_2208310822 or Runde_Remoy3_2208310822.

If you can provide an example of a SeaBee mission/campaign that you think doesn't fit the proposed structure, I'll take a look at the data on Sharepoint/MinIO and see what changes are necessary.

medyang commented 1 year ago

Hi @JamesSample Ok i see, got ya we can work with that, that separates flights. So for example, pix4d files, would just be kept in the orthophoto file... Would we just dump in all .tif files in that ortho folder, or can they be left in the folders designated by the program?

JamesSample commented 1 year ago

Hi @medyang,

The main aim at the moment is to make the "core" datasets (orthophotos, raw images, GCPs etc.) fairly obvious and easy to find, regardless of whether the processing is done using ODM or Pix4D. The exact files generated by each piece of software will be different, but the fundamental outputs should be the same (I think).

So for a typical Pix4D mission you might upload data something like this:

Runde_Remoy_202208310822
├── dem
├── gcp
│   └── gcp_list-Pix4D.txt
├── ground-truth
├── images
│   ├── img_0001.jpg
│   └── img_0002.jpg
├── orthophoto
│   └── pix4d_orthophoto.original.tif
├── other
├── report
│   └── pix4d-report.pdf
├── texturing
└── config.yaml

This makes all the "core" components (raw images, GCPs, orthophoto and report) pretty obvious. All other outputs relating specifically to this flight (e.g. anything else generated by Pix4D that doesn't clearly belong in one of the other folders) should be placed within other. If you wish, you can dump the entire Pix4D output directory into other too, but we still need the raw images within the images subfolder and the final orthophoto within the orthophoto subfolder.

If the Pix4D folder structure is always the same, we can write a script to reorganise things automatically. So you could just dump the entire Pix4D output (renamed like Runde_Remoy_202208310822), then specify pix4d: true or something in config.yaml, and the platform would handle the rest. This needs further investigation, though, and a good starting point would be to have some Pix4D output organised roughly as shown above so we can do some tests.

Ultimately we'd like to switch from Pix4D to ODM and do the mosaicing on the platform itself, but first we need to do a rigorous quality comparison to ensure results from ODM are comparable.

Does that make sense?

medyang commented 1 year ago

Hi @JamesSample Excuse the delay getting back to you. Everything you mention makes sense, i think we can work with that. I will ask @knl88 for another tutorial :)

Q: Would we be able to include an altitude and sensor type identifier in the main folder label? At the minimum sensor type as we often fly with multiple sensors (RGB and MSI) so flight times are identical. For example: Runde_Remoy_202208310822_msi_120 or At least minimum: Runde_Remoy_202208310822_msi

Other options for _postfixes would be: hsi / rgb

Regards Medyan

HegeGundersen commented 1 year ago

I agree with @medyang that sensor type and flight height is very useful (I would say crucial) information. Either in folder or file name.

awigeon commented 1 year ago

Why not Runde_Remoy-msi120_202208310822?

awigeon commented 1 year ago

@HegeGundersen In secret I do this already, ie. writing project or some other grouping instead of an actual region. I would say region could be used more to group the flight in the folder, in however way you like. Dont really see a problem with that. Georeferencing comes from the files anyway.

@JamesSample I usually have solved this by simply
niva/
├─ year/
│  ├─  grouping_areaRGB_yyyymmdd/
│  ├─  grouping_areaMS_yyyymmdd/
│  ├─  grouping_areaRGB1_yyyymmdd/
│  ├─  grouping_areaRGB2_yyyymmdd/
│  ├─  grouping_areaMS1_yyyymmdd/
│  ├─  grouping_areaMS2_yyyymmdd/
I think that is a more clean folder structure, and think is is kind of flexible enough to cover all cases. It is however slightly hard to computer read. Maybe adding a _ or - somewhere

Or like this ^

HegeGundersen commented 1 year ago

niva/ ├─ year/ │ ├─ grouping_areaRGB_yyyymmdd/ │ ├─ grouping_areaMS_yyyymmdd/ │ ├─ grouping_areaRGB1_yyyymmdd/ │ ├─ grouping_areaRGB2_yyyymmdd/ │ ├─ grouping_areaMS1_yyyymmdd/ │ ├─ grouping_areaMS2_yyyymmdd/

I guess this would work fine (e.g. RGB120)

JamesSample commented 1 year ago

@medyang @HegeGundersen @awigeon

I agree that sensor type and altitude are important; the question is where to store them. As Sindre points out, various options were discussed earlier in this thread, but it's hard to find a structure that pleases everyone.

What do you think of the following options?

Option 1: Use the existing spec

The grouping parameter in the existing specification is flexible. For example, for Runde you could do:

niva
└── 2022
    └── Runde
        ├── NIVA-RGB-120m_Remoy_202208310822
        ├── NIVA-MS-120m_Remoy_202208310822
        ├── Spectrofly-RGB-80m_Remoy_202208311045
        └── Spectrofly-MS-80m_Remoy_202208311045

I'd say this is a pretty sensible way to define flight "groupings" for a particular area. Sindre's suggestion above also works, although I think grouping is perhaps more appropriate than area for this kind of information.

This approach is flexible, in the sense that Sindre can use it in the way he wants for his seabirds surveys, and NIVA can also use it to add additional information, if desired.

Option 2: Make `org`, `spec` and `height` mandatory in the file path

For example grouping_area_org_spec_elevation_yyyymmddHHMM. This was discussed originally and would satisfy NIVA, but it would be annoying for NINA (and possibly others too).

Option 3: Make `org`, `spec` and `height` optional in the file path

As discussed previously, we could do grouping_area_yyyymmddHHMM_org-spec-elev, where the first three parameters are mandatory and everything after the last underscore is optional. This is more flexible, but also more error-prone (and I'd like to keep things simple if possible).

Option 4: Include these details in `config.yaml`

config.yaml is a good place for additional metadata, and things like camera info and flight elevation would fit well here. With this approach, the camera type and elevation would not be immediately obvious from the file path, but you could get it fairly easily from config.yaml (and most people will search for data on GeoNode anyway, and then only go to MinIO once they know what they want).

I think option 1 works OK and option 4 also makes sense. Option 2 is going to annoy some users and Option 3 makes the system complicated and is therefore likely to be done wrongly.

I don't think any single system will satisfy everyone, so what compromise do you prefer? (I don't mind - it's the pilots and ecologists that will mostly need to use it, not me ;-) )

medyang commented 1 year ago

@JamesSample @HegeGundersen Option 3 gets my vote. Trouble with making it optional, is that it is forgotten even for those it can help, But, choice is good, at least in the beginning. And @awigeon suggestion for keeping it simple is quite ok although i think we can stick to the flight time as an ID rather than needing to number flights For example: niva └── 2022 └── Runde ├── NIVA-RGB120_Remoy_202208310822 ├── NIVA-MSI120_Remoy_202208310822 ├── Spectrofly-RGB80_Remoy_202208311045 └── Spectrofly-MSI80_Remoy_202208311045

Option 4 does not help much from a pilot perspective and data organizing post flight IMHO.

HegeGundersen commented 1 year ago

Thanks for the overview @JamesSample. From my perspective and usage, I would prefer not having to look into config.yaml to see what kind of image it is. I have not yet worked a lot with these files, but at least when I work with the images via wms in ArcGIS, I need to know "in a glance" what kind (sensor and height) the image is. But that is maybe an other issue (pardon me, if I mix up things here ;-)).

medyang commented 1 year ago

@JamesSample Would it be ok with this top file structure? (flights folders organize in flight order.)

niva_202208310822_Runde_Remoy_rgb_100 ├── dem ├── gcp │ └── gcp_list-Pix4D.txt ├── ground-truth ├── images │ ├── img_0001.jpg │ └── img_0002.jpg ├── orthophoto │ └── pix4d_orthophoto.original.tif ├── other ├── report │ └── pix4d-report.pdf ├── texturing └── config.yaml

JamesSample commented 1 year ago

Hi @medyang,

Sorry for the delay. The names you propose are fine by me, but they're quite similar to some of the options proposed above that Sindre felt were too complicated.

To avoid going around in circles, how about this:

Pilots/organisations can name and organise the flight folders however they like. As long as the contents are structured as documented in the specification. For example

flight_folder_with_custom_name_that_works_for_me
├── dem
├── gcp
│ └── gcp_list-Pix4D.txt
├── ground-truth
├── images
│ ├── img_0001.jpg
│ └── img_0002.jpg
├── orthophoto
│ └── pix4d_orthophoto.original.tif
├── other
├── report
│ └── pix4d-report.pdf
├── texturing
└── config.yaml

We add grouping, area and datetime as mandatory parameters in config.yaml, and we also add spectrum_type and flight elevation as optional parameters.
Our code will scan for files named config.yaml and will read all metadata from there, rather than attempting to parse the folder name. Every folder containing a file named config.yaml will be assumed to be a flight folder with data organised as above (but the folder name itself will be ignored and any supporting information will be read from config.yaml).

The advantage of this is that each pilot/organisation/whoever can organise their flight data as they wish, as long as the data from each flight are in a single folder and consistently structured. This gives users complete flexibility regarding how they structure their data, and avoids having to write complicated and potentially fragile code that attempts to extract metadata from file paths (which I think we should keep to an absolute minimum).

The downside is that pilots will need to make sure config.yaml is filled-in correctly, because the code will not attempt to automatically parse folder names (like it does at present).

What do you think @medyang, @knl88, @awigeon, @HegeGundersen?

The alternatives seem to be either requiring NINA to include additional information that Sindre feels in unnecessary, or asking NIVA to leave out information that people feel is important. From a platform perspective it doesn't matter at all - I just need the data files to be organised consistently relative to config.yaml.

Thanks!

HegeGundersen commented 1 year ago

I like this pragmatic and flexible approach. I assume then that NIVA and SpectroFly add fly height and sensor type to the flight folder name? Alternatively in the image name?

JamesSample commented 1 year ago

@HegeGundersen Yes, with this approach NIVA could name the flight folders according to Medyan's suggestion above (e.g. niva_202208310822_Runde_Remoy_rgb_100) and NINA could continue to use grouping_area_yyyymmddHHMM, like we have so far. The code on the platform wouldn't attempt to read the folder names at all - it'd just look for the config.yaml files and then search the containing folders for the expected subdirectories (images, gcp etc.).

medyang commented 1 year ago

flight_folder_with_custom_name_that_works_for_me ├── dem ├── gcp │ └── gcp_list-Pix4D.txt ├── ground-truth ├── images │ ├── img_0001.jpg │ └── img_0002.jpg ├── orthophoto │ └── pix4d_orthophoto.original.tif ├── other ├── report │ └── pix4d-report.pdf ├── texturing └── config.yaml

@JamesSample This is good and i think should will work for us at NIVA.

SeaBee-no / documentation