A way to deal with additional data processing on download?

fiboa / cli

CLI for fiboa (validation, inspection, schema and file creation, etc.)

https://pypi.org/project/fiboa-cli/

Apache License 2.0

0 stars 7 forks source link

A way to deal with additional data processing on download? #36

Closed cholmes closed 4 months ago

cholmes commented 6 months ago

When working on #31 I tried to directly download the data from the eurocrops source https://zenodo.org/records/8229128/files/FR_2018.zip But the file structure in the downloaded zip is:

Archive:  tmp0erqmpte.zip
   creating: FR_2018/
  inflating: FR_2018/FR_2018_EC21.prj  
  inflating: FR_2018/FR_2018_EC21.dbf  
  inflating: FR_2018/FR_2018_EC21.shp  
  inflating: FR_2018/FR_2018_EC21.cpg  
  inflating: FR_2018/FR_2018_EC21.shx

Geopandas doesn't like that. I just ended up calling the local, unzipped file, but now there's not a way for people to get the data from the source.

I think it's ok for now, I put in the flatgeobuf from source https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb which I think should work, though I've not yet tested the 6 gig download in one go. But it seems like it'd be better to be able to have some way to handle this? Perhaps another 'block' in the template where you can put some python code to do 'pre-processing'?

cholmes commented 6 months ago

Ok, and now I've got a much more complicated one, see https://phys-techsciences.datastations.nl/dataset.xhtml?persistentId=doi:10.17026/dans-xy6-ngg6

This has two datasets that will be part of fieldscapes. It looks like it's a 100+ individual geopackages for cambodia areas and vietnam areas. So the ideal 'pre-processing' would need to combine them all into one geopackage.

Though I also could just download them all, combine them, and put them on source cooperative, and then have the converter use that - then we wouldn't need to build pre-processing logic into the converter.

(I have memories of some other field boundary dataset that was really weird, but can't find it now, but I'm sure there are some other examples)

m-mohr commented 6 months ago

Yeah, that sounds reasonable. There are some existing solutions in the implementations, which may solve it for you:

at: Show ZIP extraction and picking a file from it (that's the pre-processing pretty much, we can check whether this needs a separate abstraction point)
de_bb: Shows how to pick a specific shapefile from a ZIP through the layers option

We could certainly also allow a list of URIs and concatenate them, but this only works for simpler cases. Once it starts with different projections etc, it will need custom code.

m-mohr commented 6 months ago

On the other hand, look at the at datasets implementation. It's relatively simple to do the extraction and then to pick a file.

m-mohr commented 4 months ago

Forgot to close, this is solved now.