Data loading and cleaning

jrbourbeau commented 1 year ago

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.

[ ] read_csv, clean up, to_parquet
[ ] read_parquet, clean up, to_snowflake
[ ] from_zarr, rechunk, to_zarr

xref https://github.com/coiled/coiled-runtime/issues/725

@douglasdavis and I spoke about this group of workflows and he's up for owning them. As I mentioned offline, I think https://github.com/coiled/coiled-runtime/pull/724 is a good example of what adding a new workflow looks like.

jrbourbeau commented 1 year ago

Earlier today @douglasdavis and I were looking at the GDELT dataset (https://www.gdeltproject.org), which is publicly available on AWS (https://registry.opendata.aws/gdelt), for the read_csv + cleanup + to_parquet workflow. It looks promising -- @douglasdavis is going to explore it a bit more

jrbourbeau commented 1 year ago

Just checking in, @douglasdavis did you have a chance to explore the GDELT dataset more? Could you give some thoughts on some of its characteristics and if you think it'd be a good fit for this?

douglasdavis commented 1 year ago

Still doing some exploring; something that I think may be a bit of an obstacle; a naïve read_csv call:

columns_v1 = ["GlobalEventID", "Day", "MonthYear", "Year", ...]

df = dd.read_csv(
   "s3://gdelt-open-data/events/*.csv", 
   names=columns_v1, 
   sep="\t", 
   storage_options={"anon": True},
)

raises

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

Of course we can fix this with an explicit dtype definition, but I'm wondering if that's something we'd like to avoid in these workflows. Should we try to make naive reads just work ™. I can also see this is a place to be explicit with some potential string[pyarrow] use. But this is a long list of columns, so it'd be a bit boilerplatey to do a lot of establishing of columns and their dtypes. Curious to hear your thoughts!

full exception

``` ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +-----------------------+---------+----------+ | Column | Found | Expected | +-----------------------+---------+----------+ | Actor1Code | object | float64 | | Actor1CountryCode | object | float64 | | Actor1EthnicCode | object | float64 | | Actor1Geo_ADM1Code | object | float64 | | Actor1Geo_CountryCode | object | float64 | | Actor1Geo_Fullname | object | float64 | | Actor1Geo_Lat | float64 | int64 | | Actor1Geo_Long | float64 | int64 | | Actor1KnownGroupCode | object | float64 | | Actor1Name | object | float64 | | Actor1Religion1Code | object | float64 | | Actor1Religion2Code | object | float64 | | Actor1Type1Code | object | float64 | | Actor1Type2Code | object | float64 | | Actor1Type3Code | object | float64 | | Actor2EthnicCode | object | float64 | | Actor2KnownGroupCode | object | float64 | | Actor2Religion1Code | object | float64 | | Actor2Religion2Code | object | float64 | | Actor2Type2Code | object | float64 | | Actor2Type3Code | object | float64 | +-----------------------+---------+----------+ The following columns also raised exceptions on conversion: - Actor1Code ValueError("could not convert string to float: 'AFR'") - Actor1CountryCode ValueError("could not convert string to float: 'AFR'") - Actor1EthnicCode ValueError("could not convert string to float: 'baq'") - Actor1Geo_ADM1Code ValueError("could not convert string to float: 'NI'") - Actor1Geo_CountryCode ValueError("could not convert string to float: 'NI'") - Actor1Geo_Fullname ValueError("could not convert string to float: 'Nigeria'") - Actor1KnownGroupCode ValueError("could not convert string to float: 'EEC'") - Actor1Name ValueError("could not convert string to float: 'AFRICA'") - Actor1Religion1Code ValueError("could not convert string to float: 'ATH'") - Actor1Religion2Code ValueError("could not convert string to float: 'CTH'") - Actor1Type1Code ValueError("could not convert string to float: 'BUS'") - Actor1Type2Code ValueError("could not convert string to float: 'LAB'") - Actor1Type3Code ValueError("could not convert string to float: 'MIL'") - Actor2EthnicCode ValueError("could not convert string to float: 'per'") - Actor2KnownGroupCode ValueError("could not convert string to float: 'IRC'") - Actor2Religion1Code ValueError("could not convert string to float: 'MOS'") - Actor2Religion2Code ValueError("could not convert string to float: 'CTH'") - Actor2Type2Code ValueError("could not convert string to float: 'MIL'") - Actor2Type3Code ValueError("could not convert string to float: 'MIL'") Usually this is due to dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'Actor1Code': 'object', 'Actor1CountryCode': 'object', 'Actor1EthnicCode': 'object', 'Actor1Geo_ADM1Code': 'object', 'Actor1Geo_CountryCode': 'object', 'Actor1Geo_Fullname': 'object', 'Actor1Geo_Lat': 'float64', 'Actor1Geo_Long': 'float64', 'Actor1KnownGroupCode': 'object', 'Actor1Name': 'object', 'Actor1Religion1Code': 'object', 'Actor1Religion2Code': 'object', 'Actor1Type1Code': 'object', 'Actor1Type2Code': 'object', 'Actor1Type3Code': 'object', 'Actor2EthnicCode': 'object', 'Actor2KnownGroupCode': 'object', 'Actor2Religion1Code': 'object', 'Actor2Religion2Code': 'object', 'Actor2Type2Code': 'object', 'Actor2Type3Code': 'object'} to the call to `read_csv`/`read_table`. ```

sneaking in an edit here- another thought I forgot to add: one "problem" with csv as a data format is the need to by hand declare what types things are/should be (as opposed to some binary format already knowing exact types), so perhaps that need in the workflow is a feature to highlight, not necessarily a problem

jrbourbeau commented 1 year ago

Agreed that's an obstacle, but I think it's one that users will find familiar, so I think including it in the workflow is reasonable. Users may actually appreciate seeing it as it tends to apply to more real world use cases where things are messy.

mrocklin commented 1 year ago

Broadly I think that the theme here might be "we shouldn't shy away from real-world pain"

coiled / benchmarks

Data loading and cleaning #726