Open tbhallett opened 6 months ago
Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?
Hi @tbhallett . Do you want all resource files under resources folder to be of .csv format?
Yes I think that would be really good if possible.
Where a sheet has multiple sheets and various things that are not used, we can drop the unused stuff (as we can archive the excel file in Dropbox), and just replace in the code pd.read_excel for pd.read_csv.
Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.
Let's tag @tamuri and @matt-graham for their thoughts and in case there is some ready-made solution for this.
Where multiple sheets in the excel file are being used, we might need to create a neat little solution - for instance putting multiple csv files into a folder, and then making a utility function so that reading this in behaves the same as pd.read_excel() did: i.e, if sheet_name=None, return a dict of pd.DateFrames for all the sheets, and otherwise provide a pd.DateFrame of just the target sheet.
This sounds like a great solution to me - mapping an Excel file with multiple (in use) sheets to a directory of CSV files should keep the current functional grouping of related resource file sheets apparent, while removing the need to have Excel files, and as @tbhallett suggests having a helper function which takes a path to a directory and returns a dictionary mapping from CSV file name to a dataframe should mean the changes in code where currently using read_excel
should be minimal.
In terms of automating doing this, I think using Pandas to deal with conversion from Excel to CSV would work - something like (this is untested!)
for excel_file_path in resource_file_path.rglob("*.xslx"):
sheet_dataframes = pd.read_excel(excel_file_path, sheet_name=None)
excel_file_directory = excel_file_path.with_suffix("")
# Create a container directory for per sheet CSVs
if excel_file_directory.exists():
print(f"Directory {excel_file_directory} already exists")
else:
excel_file_directory.mkdir()
# Write a CSV for each worksheet
for sheet_name, dataframe in sheet_dataframes.items():
dataframe.to_csv(excel_file_directory / sheet_name + ".csv")
# Remove no longer needed Excel file
excel_file_path.unlink()
Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time
Thanks Tim and Matt for the clarification on this. @tbhallett do you want me to start working on this issue? I have some time
Yes please @mnjowe -- that would be brilliant
Great!
Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks
def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]:
"""
A function to read csv files in a similar way pandas read Excel files(pd.read_excel).
:param resource_filepath: path to resource file folder
:param file_name: preferred csv file name(s). This is the same as sheet names in Excel file
"""
all_data = {}
if file_name is not None:
for f_name in file_name:
df = pd.read_csv(f'{resource_filepath}/{f_name}.csv')
all_data[f_name] = df
else:
for f_name in resource_filepath.rglob("*.csv"):
file_base_name = str(f_name).split('/')[-1].split('.')[0] # Extract base name without extension
df = pd.read_csv(f_name)
all_data[file_base_name] = df
return all_data
Hi both. In order to get started with this issue, I've written the below function and have tested it already in lifestyle. My question is which is the right module to house this helper function as it will be needed in the read parameters section of all modules still using excel files. @matt-graham thanks for the above code(it didn't need a lot of modifications to do its job). all excel files are now turned into folders with one or multiple .csv files. Thanks
def read_csv_files(resource_filepath: Path = None, file_name: list[str] = None) -> dict[str, DataFrame]: """ A function to read csv files in a similar way pandas read Excel files(pd.read_excel). :param resource_filepath: path to resource file folder :param file_name: preferred csv file name(s). This is the same as sheet names in Excel file """ all_data = {} if file_name is not None: for f_name in file_name: df = pd.read_csv(f'{resource_filepath}/{f_name}.csv') all_data[f_name] = df else: for f_name in resource_filepath.rglob("*.csv"): file_base_name = str(f_name).split('/')[-1].split('.')[0] # Extract base name without extension df = pd.read_csv(f_name) all_data[file_base_name] = df return all_data
How about making this functionality a part of the Module
class? The Module's read_parameters()
method is where this is always used, and It's the same kind of thing as load_parameters_from_dataframe()
which is a member function.
That's a good idea. I will add it to Module
class, implement it in lifestyle read_parameters( )
section and open a new PR for further discussions. Thanks
I'd lean towards making it a utility function in util.py
rather than Module. It feel likes a general purpose util - imagine wanting to load the files outside of a module, in an analysis file for example.
Good point Asif, thanks. Indeed, scenarios as these will require us to think twice on the location of this function. @tbhallett I will defer to you for a final decision on this.
Happy to go with Asif's suggestion! (We could always put a shortcut to it from the module, for convenience).
Great!
Hi all. I have created a draft PR here where we can continue with our discussion on this issue. I've started implementation in lifestyle and simplified birth. May you please to look at this initial stage of implementation and provide feedback if any before I move on to implementing the read csv files method to the rest of the disease modules? Thanks
We have some .xlsx ResourceFiles in use, but this makes comparison using
git
very cumbersome and it's not not necessary to use the Excel file format. It should be a straight forward task to convert the few.xlsx
into.csv
format.