Convert '.xlsx' ResourceFiles to `csv`

tbhallett commented 6 months ago

We have some .xlsx ResourceFiles in use, but this makes comparison using git very cumbersome and it's not not necessary to use the Excel file format. It should be a straight forward task to convert the few .xlsx into .csv format.

mnjowe commented 4 months ago

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

tbhallett commented 4 months ago

Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?

Yes I think that would be really good if possible.

Where a sheet has multiple sheets and various things that are not used, we can drop the unused stuff (as we can archive the excel file in Dropbox), and just replace in the code pd.read_excel for pd.read_csv.

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

Let's tag @tamuri and @matt-graham for their thoughts and in case there is some ready-made solution for this.

matt-graham commented 4 months ago

Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.

This sounds like a great solution to me - mapping an Excel file with multiple (in use) sheets to a directory of CSV files should keep the current functional grouping of related resource file sheets apparent, while removing the need to have Excel files, and as @tbhallett suggests having a helper function which takes a path to a directory and returns a dictionary mapping from CSV file name to a dataframe should mean the changes in code where currently using read_excel should be minimal.

In terms of automating doing this, I think using Pandas to deal with conversion from Excel to CSV would work - something like (this is untested!)

for excel_file_path in resource_file_path.rglob("*.xslx"):
    sheet_dataframes = pd.read_excel(excel_file_path, sheet_name=None)
    excel_file_directory = excel_file_path.with_suffix("")
    # Create a container directory for per sheet CSVs
    if excel_file_directory.exists():
        print(f"Directory {excel_file_directory} already exists")
    else:
        excel_file_directory.mkdir()
    # Write a CSV for each worksheet
    for sheet_name, dataframe in sheet_dataframes.items():
        dataframe.to_csv(excel_file_directory / sheet_name + ".csv")
    # Remove no longer needed Excel file
    excel_file_path.unlink()

mnjowe commented 4 months ago

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

tbhallett commented 4 months ago

Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time

Yes please @mnjowe -- that would be brilliant

mnjowe commented 4 months ago

Great!

mnjowe commented 4 months ago

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks

    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data

tbhallett commented 4 months ago

Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks
    def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
        """
        A function to read csv files in a similar way pandas read Excel files(pd.read_excel).

        :param resource_filepath:  path to resource file folder
        :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file

        """
        all_data = {}

        if file_name is not None:
            for f_name in file_name:
                df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
                all_data[f_name] = df

        else:
            for f_name in resource_filepath.rglob("*.csv"):
                file_base_name = str(f_name).split('/')[-1].split('.')[0]  # Extract base name without extension
                df = pd.read_csv(f_name)
                all_data[file_base_name] = df

        return all_data 

How about making this functionality a part of the Module class? The Module's read_parameters() method is where this is always used, and It's the same kind of thing as load_parameters_from_dataframe() which is a member function.

mnjowe commented 4 months ago

That's a good idea. I will add it to Module class, implement it in lifestyle read_parameters( ) section and open a new PR for further discussions. Thanks

tamuri commented 4 months ago

I'd lean towards making it a utility function in util.py rather than Module. It feel likes a general purpose util - imagine wanting to load the files outside of a module, in an analysis file for example.

mnjowe commented 3 months ago

Good point Asif, thanks. Indeed, scenarios as these will require us to think twice on the location of this function. @tbhallett I will defer to you for a final decision on this.

tbhallett commented 3 months ago

Happy to go with Asif's suggestion! (We could always put a shortcut to it from the module, for convenience).

mnjowe commented 3 months ago

Great!

mnjowe commented 3 months ago

Hi all. I have created a draft PR here where we can continue with our discussion on this issue. I've started implementation in lifestyle and simplified birth. May you please to look at this initial stage of implementation and provide feedback if any before I move on to implementing the read csv files method to the rest of the disease modules? Thanks

UCL / TLOmodel

Convert '.xlsx' ResourceFiles to `csv` #1337