Closed rdmolony closed 1 year ago
I reckon a recipe on this topic might make help clarify how to get data in & out of fugue
. I've always found it a little confusing understanding how to get data into fugue
& how the execution engine impacts this
So with CSV specifically, it's a bit problematic because you normally need a bunch of keyword arguments that may not be unified across execution engines. That is why you feel like you need a custom creator. If you stick in Pandas, you won't notice this, but the problem is compatibility with other execution engines. This is why parquet is the preferred file format.
The other thing I want to bring up is that DuckDB should be way faster than Pandas for loading a bunch of files with wildcard. We can use DuckDB to preprocess this and write out the parquet file for further processing. With the current setup of the CSV, it's a bit hard to work with, so I would just make a parquet file like this:
from fugue_sql import fsql
headers = [c.strip('"') for c in linecache.getline(str(sample_logger_1), lineno=2).strip().split(",")]
res = fsql("""
LOAD "/var/folders/w2/91_v34nx0xs2npnl3zsl9tmm0000gn/T/campbell_scientific_*.csv" (header=TRUE, skip=3 , infer_schema=TRUE)
YIELD DATAFRAME AS result
""").run("duckdb")
df= res['result'].as_pandas()
df.columns = headers
df.to_parquet("/tmp/test.parquet")
And then use the parquet for downstream stuff. You might be able to get it working on SparkSQL also with a few tweaks for big data. I just didn't try because Spark didn't have access to the tempdir for me. If you use Spark, just write out the parquet with Spark too.
The Creator you made will work for sure, it's just tied to Pandas though (unavoidably) unless you take in an engine
and fork the logic of that function. If the purpose is just to read a bunch of small files and collect then, this setup should be the most helpful for preprocessing.
Does that answer you?
@rdmolony I think this dataset is special, it is not in standard csv format. So special handling like what you did:
def read_campbell_scientific_textfile(filepath: str) -> pd.DataFrame:
headers = [c.strip('"') for c in linecache.getline(filepath, lineno=2).strip().split(",")]
return pd.read_csv(filepath, names=headers, skiprows=5)
it makes perfect sense. (skiprows should be 4 in your case)
Reading CSV is already very challenging for any backends, unifying them is harder, adding special handling is almost impossible. So using creator is a great way to get data into fugue.
@kvnkho has shown how to get data out of Fugue using yield. You can yield multiple dataframes in one workflow or fugue sql. Programmatically, you can do
dag = FugueWorkflow():
df = dag.create(
read_campbell_scientific_textfile, params={"filepath": str(csv_filepaths[0])}
)
df.yield_dataframe_as("result")
dag.run()["result"].as_pandas()
Notice, we may deprecate the with
statement for using FugueWorkflow
soon, it's a confusing design. Please use dag = FugueWorkflow() ... dag.run()
instead.
Thanks a lot @kvnkho & @goodwanghan for your detailed explanations. You've cleared it up. I really like the DuckDB
option Kevin, though I personally prefer the functional API for now!
I found adding a step to infer a Fugue
schema from sample data to be helpful in loading data with Fugue
as I can use skiprows
in pandas
or skip
in DuckDB
alongside columns
in the functional API to load this csv
without a pandas
intermediate step.
Do you think infer_fugue_schema
or similar is something that Fugue
would be interested in supporting? Or is it better in general for users to just use custom creators for non-standard data formats like above?
# See below for full solution!
# --- fsql ---
inferred_duckdb_schema = infer_duckdb_schema(str(csv_filepaths[0]))
# I couldn't pass a fugue schema to `columns`;
# looks like a conflict between fugue `columns` & `DuckDB` `COLUMNS`
result = fsql(f"""
LOAD '{csv_glob}' (HEADER=TRUE, SKIP=3, COLUMNS={inferred_duckdb_schema}, infer_schema=TRUE)
YIELD DATAFRAME AS result
SAVE OVERWRITE '{parquet_file}'
""").run("duckdb")
# --- functional api ---
inferred_fugue_schema = infer_fugue_schema(str(csv_filepaths[0]))
load_csvs_to_parquet = FugueWorkflow(engine="duckdb")
raw_sensors = load_csvs_to_parquet.load(
csv_glob, columns=inferred_fugue_schema, skip=4, header=False
)
raw_sensors.save(parquet_file)
load_csvs_to_parquet.run(engine="duckdb")
sample = pd.read_parquet(parquet_file)
Hey @rdmolony ,
Sorry for the late reply. I understand the intention here, but I am hesitant to include it until I see the use case more in the wild. Our goal is to be as minimal as possible for the Fugue interface, and the Fugue loading does have a kwarg infer_schema=TRUE
so this is kind of confusing. It seems the issue here is that the file being ingested as multiple headers.
I will look more into that COLUMNS thing.
As an immediate answer though, you can consider contributing your example under the Recipes here: https://fugue-tutorials.readthedocs.io/tutorials/applications/recipes/index.html
You can make a page that is something like, how to read multi-header files. What do you think?
Sure thing Kevin, that makes sense to me ;)
Hey @kvnkho
I'm back using
fugue
again. I was wondering what the canonicalfugue
method is for loading multiple csvs intofugue
? I can write this up as a recipe after if you likeI have a directory of csvs that I want to load, where the header row is on the 2nd line & data is on the 5th
EDIT: I've experimented with a few different methods, in example 4 I explicitely infer the schema & pass this to
FugueWorkflow().load
...I want to load multiple text files & persist the initial read as
parquet
, what would you recommend?