Open jrbourbeau opened 1 year ago
Earlier today @douglasdavis and I were looking at the GDELT dataset (https://www.gdeltproject.org), which is publicly available on AWS (https://registry.opendata.aws/gdelt), for the read_csv
+ cleanup + to_parquet
workflow. It looks promising -- @douglasdavis is going to explore it a bit more
Just checking in, @douglasdavis did you have a chance to explore the GDELT dataset more? Could you give some thoughts on some of its characteristics and if you think it'd be a good fit for this?
Still doing some exploring; something that I think may be a bit of an obstacle; a naïve read_csv
call:
columns_v1 = ["GlobalEventID", "Day", "MonthYear", "Year", ...]
df = dd.read_csv(
"s3://gdelt-open-data/events/*.csv",
names=columns_v1,
sep="\t",
storage_options={"anon": True},
)
raises
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
Of course we can fix this with an explicit dtype definition, but I'm wondering if that's something we'd like to avoid in these workflows. Should we try to make naive reads just work ™. I can also see this is a place to be explicit with some potential string[pyarrow]
use. But this is a long list of columns, so it'd be a bit boilerplatey to do a lot of establishing of columns and their dtypes. Curious to hear your thoughts!
sneaking in an edit here- another thought I forgot to add: one "problem" with csv as a data format is the need to by hand declare what types things are/should be (as opposed to some binary format already knowing exact types), so perhaps that need in the workflow is a feature to highlight, not necessarily a problem
Agreed that's an obstacle, but I think it's one that users will find familiar, so I think including it in the workflow is reasonable. Users may actually appreciate seeing it as it tends to apply to more real world use cases where things are messy.
Broadly I think that the theme here might be "we shouldn't shy away from real-world pain"
Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.
read_csv
, clean up,to_parquet
read_parquet
, clean up,to_snowflake
from_zarr
,rechunk
,to_zarr
xref https://github.com/coiled/coiled-runtime/issues/725
@douglasdavis and I spoke about this group of workflows and he's up for owning them. As I mentioned offline, I think https://github.com/coiled/coiled-runtime/pull/724 is a good example of what adding a new workflow looks like.