42

1. Change in simulation_coordinator.csv

Added a subset column to organize individual sites to different subsets.

2. Change in load_input() and related code

Updated to load the subset information and pass the information to downstream steps.

3. Change in snakemake execution

Updated snakemake rules to be able to run site under certain subset(s). The default value for subset is "All" which will run all available sites. User can override at run time as: snakemake --config s='core_relationship, infection_duration' -j Note that the subset name is non-case-sensitive.

4. Change in plotting steps

The run_generate_validation_comparisons_site.py script now takes an argument --subset or -s for one or multiple subsets For core_relationship subset, the following plotting functions are called:

generate_age_incidence_outputs()
generate_age_prevalence_outputs()
generate_parasite_density_outputs()
generate_infectiousness_outputs()

For infection_duration subset, the following plotting function is callded: generate_age_infection_duration_outputs()

5. Change in readme.md

Updated instruction to only mention the develop installation. Installation as a package is not working right now, please see #54 for existing issue, it depends on workitem #55
Removed the option to run script directly.
Added instruction for running the workflow for one or multiple subsets.

6. Change in reporting:

In progress: the plan is to add one more level for subset in the document. left is old report structure, right is new structure:

I also updated how we define the document content. I created a class Section to replace the nested dictionary structure we used to have:

class Section:
    def __init__(self, pdf: PDF, section_title: str, section_number: int = 1, content: dict = None, level: int = 0,
                 subsection: list = None):
        """
        Define a section object for each section in the report
        Args:
            pdf (PDF):                      A PDF object
            section_title (str):            Title of the current section
            section_number (int):           Number of current section, which starts from 1 at the beginning of the
                                            document. If the current section is a sub-section of another section,
                                            the section number starts from 1 at the sub-section level.
            content (dict):                 A key-value pairs dictionary. The keys are the subtitles and values are
                                            lists that looks like this:
                                            [section_text: str, image_list: list, table_name: str]
            level (int):                    Level of current section in the document outline. range from 0 - 2 while
                                            0 is the top-level.
            subsection (list[Section]):     A list of section objects if this section contains lower level section(s).
        """

I think the new structure is easier to read compare to the old way since we now have one more level(the subset level).

I also updated the code to work with this new Section class.

7. Change in output folders for analyzer result files?:

Should we clean up the output folder when we rerun the download steps?

8. Other changes?:

Right now, if a dataframe is empty, the code will give an attribute error saying the "Infection" (for example) is not found, this is not informative error. In this PR I updated to skip the plotting code if a certain site result is not available. Should we raise a more informative error in such case?
We need to make similar decision and change in the reporting code when a table or image file is not there. In this PR, I updated the code to ignore the table/image if it's not found and print a warning message. Do we want to throw exception and let the workflow fails in such case?

InstituteforDiseaseModeling / malaria-model_validation

Modular structure #53

42