43 dtype from csv - Githubissues

rogerkuou commented 1 year ago

Fix #43 and #44

Thanks @fnattino for spotting these issues.

Indeed #43 should be caused by the ambiguous dtype. When writing to zarr, the chunk need to be loaded to determine the dtype. I made a fix by loading 10 rows and determine the dtypes, then passing it to daskdataframe.read_csv. For pnt_id, I specified the data type as str. Then there will be no object type anymore.

I did the following test. Both the slow loading/warning, and the inconsistent chunksize should have been fixed.

Could you try if it has been fixed?

%%timeit
data = stmtools.from_csv(p_data, blocksize=100e6)
data.to_zarr('data1.zarr', mode='w')

5.43 s ± 591 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
data = stmtools.from_csv(p_data, blocksize=100e6)
data_p = data.persist()
data_p.to_zarr('data2.zarr', mode='w')

4.87 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data_chunk = stmtools.from_csv(p_data, blocksize=100e6, output_chunksize={"space":10000, "time":-1})
data_chunk.to_zarr('data_chunk.zarr', mode='w')
data_r = xr.open_zarr('data_chunk.zarr')
data_r

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

rogerkuou commented 1 year ago

@fnattino Thanks for the detailed check!

Actually, I found now if I remove the "external reading" part, the function still works as the same. Thus I made another commit.

Highly likely, the failure of dtype recognition was introduced by my previous implementation, where I tried to convert the entire dask dataframe to a dask array, then loop through all attributes. Now I roll back to the original "column wise" operation and perform the conversion per column. See this part. I found this does not influence my dask graph.

Would you like to give another look?

TUDelftGeodesy / stmtools

43 dtype from csv #46