Currently, the DNAm QC pipeline is very interdependent and nonorthogonal in design. A great deal of bugs that have been identified in the past year or so have been due to this problem.
Many parts of the pipeline depend on other parts successful completion. As a result, a seemingly harmless change to one section of code in the pipeline inevitably causes problems in a different, completely unrelated part of the pipeline.
Therefore it would be beneficial to properly layout what each part of the pipeline requires and what each part of the pipeline outputs/contributes. Doing this will allow us to disentangle and compartmentalise each section of the pipeline. This will have the benefit of making the pipeline easier to understand and therefore easier to contribute to in the future (without fear of breaking everything).
Steps to completion
1) An agreement of what the sections of the DNAm QC pipeline actually are. The blocks and headings provide a solid start (and it may be all we actually need), but it would be useful to actually have this written down. See #259.
2) A set of 'contracts' (see design by contract) detailing the requirements and expectations of each section of the pipeline. Completion of this step results in the resolution of issue #259.
a) If it feels like any contract is exceptionally long/complex, this implies that some abstraction or
further compartmentalisation is required.
3) Using these contracts create a UML diagram (or similar) to help determine the processes/steps of the pipeline that each section actually depends on.
a) Add this UML diagram (or similar) to the documentation to help users understand what the pipeline
actually does (mermaid may be useful for this). This step is given explicitly in #258 and
should be completed separately to the other steps listed
4) Reflect the diagram from 3) in the actual code. Ensure that each section does indeed receive the information it requires and does indeed output the information it is obligated to (with ideally no side effects) i.e. each section adheres to its proposed contract.
a) It may be useful to move some processing to separate files, either rmarkdown children or new R
scripts. This can help with navigation around the code base (though might be better suited to
being asked for in a further issue)
Path to affected file
Type of refactor
Code understandability improvement, Code readability improvement
Description of required code refactor
Currently, the DNAm QC pipeline is very interdependent and nonorthogonal in design. A great deal of bugs that have been identified in the past year or so have been due to this problem.
Many parts of the pipeline depend on other parts successful completion. As a result, a seemingly harmless change to one section of code in the pipeline inevitably causes problems in a different, completely unrelated part of the pipeline.
Therefore it would be beneficial to properly layout what each part of the pipeline requires and what each part of the pipeline outputs/contributes. Doing this will allow us to disentangle and compartmentalise each section of the pipeline. This will have the benefit of making the pipeline easier to understand and therefore easier to contribute to in the future (without fear of breaking everything).
Steps to completion
1) An agreement of what the sections of the DNAm QC pipeline actually are. The blocks and headings provide a solid start (and it may be all we actually need), but it would be useful to actually have this written down. See #259. 2) A set of 'contracts' (see design by contract) detailing the requirements and expectations of each section of the pipeline. Completion of this step results in the resolution of issue #259. a) If it feels like any contract is exceptionally long/complex, this implies that some abstraction or further compartmentalisation is required. 3) Using these contracts create a UML diagram (or similar) to help determine the processes/steps of the pipeline that each section actually depends on. a) Add this UML diagram (or similar) to the documentation to help users understand what the pipeline actually does (mermaid may be useful for this). This step is given explicitly in #258 and should be completed separately to the other steps listed 4) Reflect the diagram from 3) in the actual code. Ensure that each section does indeed receive the information it requires and does indeed output the information it is obligated to (with ideally no side effects) i.e. each section adheres to its proposed contract. a) It may be useful to move some processing to separate files, either rmarkdown children or new R scripts. This can help with navigation around the code base (though might be better suited to being asked for in a further issue)