YiranJing / Coronavirus-Epidemic-COVID-19

👩🏻‍⚕️Covid-19 estimation and forecast using statistical model; 新型冠状病毒肺炎统计模型预测 (Jan 2020)
243 stars 69 forks source link

dataset format and coding improve #2

Closed YiranJing closed 4 years ago

YiranJing commented 4 years ago

Hi @chrisyifanjin,

Can you please do something further for data process ?

1. can you please add more information and as the same format with:

https://github.com/pdtyreus/coronavirus-ds/blob/master/data/snapshot_jan25_12pm.csv (please use English column name only, which is easier for non-Chinese looking)

2. define a python function for data processing, for example

def data_cleaning(folder_path):
     """
    Combine data from multiple files within the given folder, and resture them
     """
     all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
     combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames],ignore_index=True)
     cleaned_data = 
     ......
     return cleaned_data 

folder_path = '../data/China'
cleaned_data  = data_cleaning(folder_path)

Another example

def preprocess_data(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
    """
Apply data processing. 
        1)  Rename columns name
        2)  Columns type cast
    """    
   # 1)  Rename column
    df = df.withColumnRenamed("POS Margin on Net Sales", "Margin")

   # 2)  Conver the `df` columns to `FloatType()`
    columns = ['NetSales', 'QtySold', 'Margin', 'StockQty']
    df = convertColumn(df, columns, FloatType())
    # Convert Date column to timestamp 
    df = df.withColumn("Date", to_timestamp(df.Date, "yyyyMMdd"))

    return df

3. Please use relative path instead of absolute path for the file, then we can run your code without change the file path:

i.e. instead of

combined_data=combined_csv.to_csv('/Users/jinyifan/Desktop/Coronavirus-Epidemic-2019-nCov/Data_processing/Data_pro_China.csv',header=True, index=False)

## relative path
combined_data=combined_csv.to_csv('../Data/Conbined_data/China/Data_pro_China.csv',header=True, index=False)
  1. (optional )You can combine Combined_data_China.ipynb and Combined_data_International.ipynb into one notebook only. because they use quite similar code, (can refer to the same function), just define different path!