Open dlersch opened 7 months ago
Created a new branch based on the common CSV parser branch #24. We need to determine a CSV dataset that we would like to use for the example. I asked @dlersch for ideas. In the meantime, I will bring in a scaler module from the exp_hall repo and make sure there are utests for it.
@sgoldenCS and I had a fruitful discussion about a possible data set that is simple enough to analyze but also highlights the functionality of the workflow / DS framework. We came up with a NP inspired classification problem: Identification of two species that are each characterized by three variables. The abundance between the individual species is asymmetric, i.e. species 1 is statistically dominant over 0. A plot of the corresponding distributions is shown below.
The data is 100% synthetic so we do not need to worry about any owner rights. The classification problem is set up such that it nicely fits into the narrative of HUGS, but it has no direct ties to NP. The data is spread over 4 .csv files so that we can use @sgoldenCS CSVParser right away. We might come up with a more challenging data set, but for now, we will stick to this one, just so that we can test and run the full workflow.
The data and the corresponding script for data generation are (for now) available on the ifarm:
/w/data_science-sciwork18/hugs24/example_data_hugs24
The file size of each .csv is ~18MB and we do not want to store them here on GitHub.
The model module needs unit tests but is done otherwise. I will be adding an analysis module in the branch linked to this issue since it is the final step towards completion. I have pulled the changes from the model branch and main so it is fully up to date before adding the analysis module. I will complete the model unit tests after I have an implementation of the analysis module for the GSPDA workshop (since it is tomorrow).
1.) We need a fully functional example workflow for the HUGS tutorial. The workflow needs to have:
2.) For each module there has to be a proof of:
3.) A good practice is to capture code-development, updates or any work in the issues. For example: "Started to implement module XYZ. Faced problem with so und so. Going to pause and run a quick literature search". This helps to keep everything transparent. Ideally, an issue tells the entire story of the work that has been done.
4.) The issues for each module should be linked to this. If we for example decide to use the CSVToPandasParser , then we should link the corresponding issue here. Same goes for wiki-pages
5.) For the sake of efficiency and time management, we should follow KIS (Keep It Simple), regarding code development.