2. define a python function for data processing, for example
def data_cleaning(folder_path):
"""
Combine data from multiple files within the given folder, and resture them
"""
all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames],ignore_index=True)
cleaned_data =
......
return cleaned_data
folder_path = '../data/China'
cleaned_data = data_cleaning(folder_path)
Another example
def preprocess_data(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
"""
Apply data processing.
1) Rename columns name
2) Columns type cast
"""
# 1) Rename column
df = df.withColumnRenamed("POS Margin on Net Sales", "Margin")
# 2) Conver the `df` columns to `FloatType()`
columns = ['NetSales', 'QtySold', 'Margin', 'StockQty']
df = convertColumn(df, columns, FloatType())
# Convert Date column to timestamp
df = df.withColumn("Date", to_timestamp(df.Date, "yyyyMMdd"))
return df
3. Please use relative path instead of absolute path for the file, then we can run your code without change the file path:
(optional )You can combine Combined_data_China.ipynb and Combined_data_International.ipynb into one notebook only.
because they use quite similar code, (can refer to the same function), just define different path!
Hi @chrisyifanjin,
Can you please do something further for data process ?
1. can you please add more information and as the same format with:
https://github.com/pdtyreus/coronavirus-ds/blob/master/data/snapshot_jan25_12pm.csv (please use
English column name
only, which is easier for non-Chinese looking)2. define a python function for data processing, for example
Another example
3. Please use
relative path
instead ofabsolute path
for the file, then we can run your code without change the file path:i.e. instead of
Combined_data_China.ipynb
andCombined_data_International.ipynb
into one notebook only. because they use quite similar code, (can refer to the same function), just define different path!