OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.89k stars 2.55k forks source link

CSV driver doesn't honor CSVT sidecar in Dataset.GetFileList(), Driver.CreateCopy(), and other I/O operations #8165

Closed gorloffslava closed 1 year ago

gorloffslava commented 1 year ago

Expected behavior and actual behavior.

Given:

Steps to reproduce the problem.

Reproduction case #1:

from osgeo import gdal
ds = gdal.OpenEx("testcsvt.csv")
ds.GetFileList()

Expected output: ['testcsvt.csv', 'testcsvt.csvt'] Actual output: ['testcsvt.csv']

Reproduction case #2:

from osgeo import gdal
ds_src = gdal.OpenEx("testcsvt.csv")
driver = gdal.GetDriverByName("CSV")
ds_dst = driver.CreateCopy("test_csvt_copy", ds_src)
ds_dst.FlushCache()

import os
os.listdir("test_csvt_copy")

Expected output: ['testcsvt.csv', 'testcsvt.csvt'] Actual output: ['testcsvt.csv']

Reproduction case #3: How can we checked that CSVT is really loaded by GDAL?

import geopandas

gdf = geopandas.read_file("testcsvt.csv")
gdf.info()
"""
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   INTCOL      1 non-null      float64       
 1   REALCOL     1 non-null      float64       
 2   STRINGCOL   2 non-null      object        
 3   INTCOL2     1 non-null      float64       
 4   REALCOL2    1 non-null      float64       
 5   STRINGCOL2  2 non-null      object        
 6   DATETIME    1 non-null      datetime64[ns]
 7   DATE        1 non-null      object        
 8   TIME        1 non-null      object        
 9   geometry    0 non-null      geometry      
dtypes: datetime64[ns](1), float64(4), geometry(1), object(4)
memory usage: 288.0+ bytes
"""
# Typings are applied correctly

os.remove("testcsvt.csvt")
gdf = geopandas.read_file("testcsvt.csv")
gdf.info()
"""
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   INTCOL      2 non-null      object  
 1   REALCOL     2 non-null      object  
 2   STRINGCOL   2 non-null      object  
 3   INTCOL2     2 non-null      object  
 4   REALCOL2    2 non-null      object  
 5   STRINGCOL2  2 non-null      object  
 6   DATETIME    2 non-null      object  
 7   DATE        2 non-null      object  
 8   TIME        2 non-null      object  
 9   geometry    0 non-null      geometry
dtypes: geometry(1), object(9)
memory usage: 288.0+ bytes
"""
# Typings no longer work. Expected, as we deleted `.csvt` sidecar w/ them.

Operating system

Reproducible w/ any of the following:

GDAL version and provenance

Reproducible w/ any of the following:

jratike80 commented 1 year ago

By reading the documentation of the CSV driver https://gdal.org/drivers/vector/csv.html, by default the .csvt file is not created. A special layer creation option is required.

CREATE_CSVT=[YES/NO]: Defaults to NO. Create the associated .csvt file (see above paragraph) to describe the type of each column of the layer and its optional width and precision.

gorloffslava commented 1 year ago

By reading the documentation of the CSV driver https://gdal.org/drivers/vector/csv.html, by default the .csvt file is not created. A special layer creation option is required.

CREATE_CSVT=[YES/NO]: Defaults to NO. Create the associated .csvt file (see above paragraph) to describe the type of each column of the layer and its optional width and precision.

Thanks for your response! We use that when writing datasets, yes, and it works.

But in our issue above, we copy dataset, not create from scratch, so expect all sidecars to be copied automatically as it happens, for example, w/ GeoTIFF or ESRI Shapefile. +It doesn't seem to affect opening datasets which already have this sidecar.

jratike80 commented 1 year ago

I may be wrong, but doesn't driver.CreateCopy make a copy of the internal presentation of the data that GDAL has after opening the source dataset? So it does not copy files even if the source and target formats are the same, but the data gets rewritten. Have you tried to use the layer creation option as I suggested? Unfortunately I am not a programmer and I can't tell how to test that.

Maybe https://gdal.org/api/python/osgeo.ogr.html#osgeo.ogr.DataSource.CopyLayer does something similar:

Duplicate an existing layer. This function creates a new layer, duplicate the field definitions of the source layer and then duplicate each features of the source layer. The papszOptions argument can be used to control driver specific creation options. These options are normally documented in the format specific documentation. The source layer may come from another dataset.

rouault commented 1 year ago

I'm working on having GetFileList() report the .csvt file, but you indeed shouldn't expect CreateCopy() to create a .csvt file, even if the source dataset is a .csv file with a .csvt. Output driver of GDAL know nothing about input drivers, and everything goes through a pivot model that forget about the implementation details. You'd better use plain file copy if you want to do CSV -> CSV without any change. As there isn't a way of provider layer creation options in the GDALDataset::CopyLayer() call done by GDALDriver::DefaultCreateCopy(), you'd better use GDALVectorTranslate() instead

gorloffslava commented 1 year ago

@rouault big thanks for fixing this! And for your explanation about CreateCopy() behavior.

@jratike80 big thanks for your assist as well!