Open ScottyB opened 2 years ago
Commented Out Code Impact Readability? Very precise research Evidence to support / reject Conflict of code & different parameters etc. Look at evolution of projects Focus on ML & evolution Find what it depends on What do engineers think of this - are they interested in outcome
Code Smells Subject ML Smells need to be defined (data or model) - could be an actual defect to be dealt with Map Code smells to concrete definitions Empirical study to measure readability → can affect deployment + production Do ML smells have an impact on the real world? Could only be a smell bc of user (smell for ML not for statistician - could be education problem) RQ: how prevalent is this issue in projects?
Smells are an indication and not concrete, subjective and romantic. Tie smells to productivity.
Do code smells impact review time? Definition of code smells, and catalogue. Assumptions: Developers know what they are Developers care at review time Developers disagree of severity/importance
Evolution of code smells: Which ones can be ignored? Identify code smells first and then evaluate them. Tie code smells to: Reproducibility Performance *ilities - definition of stakeholders, hard to make concrete
Commented out code indicative of versioning in addition to version control (Not just DS). Solved through education, workstyle of people -> experimentation. Reading code that has commented out of code. Does commented out code impact readability of the source code?
Code smells: personal preference of people who made up these phrases, mere guidelines
Small companies don’t care about code smells (experiential)
Code smells should be avoided (low impact and unreliable)
Data versioning tool -> large datasets, experiments, storage perspective.
Motivation: data storage is a problem for cloud service providers. Redundancy between versions of the data and you shouldn’t be storing all the features. Store code transformations rather than datasets Efficient caching
Think about the scale of data storage. Is the transformed data the challenge with storing versions of the data?
Linting data with diffs applied at the smells level -> in the context of ML, data and code smells are all impactful.
Identify smells that change the behaviour of ML. There is a need to define the definition of an ML smell -> this is different from code smells. Thus, ML smells are a) technical debt, b) actual defect, and c) are concrete.
Guidelines for dealing with technical debt ignores the commercial reality and focuses on the ideal. This ties in with the context idea.
-> Case study of ML in small clients/companies (Related work exists for this???)efficiency is vital. Tool support is key to realise the solutions in organisations.
Is static analysis sufficient for ML smells? Interactive environment: development phase, deployment phase.
CI/CD for ML (Thoughtworks)
Start at the upstream process at the data level rather than modelling.
Conclusions: Code smells should be avoided (low impact and unreliable) Focus on the messy upstream process at data collection rather than modelling There is a need to define the definition of an ML smell Does commented out code impact readability of the source code? Can diff algorithms be used for data versioning? Data quality issues should be easy to validate/verify -> portable as data their own The barrier between data and code is still valid