[x] Do not use relative imports (except in __init__.py, we will discuss this in the Python package module).
[x] Don’t use from module import * (except in __init__.py, we will discuss this in thePython package module).
[x] Place executable scripts at the top-level of your directory structure.
[x] Place all imports at the top of the module in the following order: native Python libraries, 3rd party packages, local modules.
[x] Add paths to the PYTHONPATH env variable (not via sys.path.append()).
[x] Use argparse to define command-line options and arguments.
Logging
[x] Use at least three levels of logging in your code
[x] Use logging level DEBUG for things you only want to log while developing.
[x] Use logging level INFO for things you want to monitor while in production.
[x] Use logging level WARNING for things that would need attention in production.
[x] Use logging level ERROR for when an exception occurs .
[x] Configure your logging from logging configuration files to allow for easy switching between
development and production.
[x] Only configure your logger from the executing script.
[x] In modules, set up your logger as logger = logging.getLogger(__name__).
[x] Don’t commit .log files to git, put .log in your .gitignore.
[x] Include timestamps in your logs.
Software Testing
[x] Do not write unit tests for any function that makes a network call (e.g. calls an API, queriesa database).
[x] Tests do not have to be written for plotting functions.
[x] Place tests in tests/ folder.
[x] Structure the tests/ folder the same as the code you’re testing (e.g. same structure as src/).
[x] For a module, module_x.py, name the test file, test_module_x.py.
[x] For all tests related to a function(), name the test function, test_function_<descriptor>.py, where <descriptor>
describes what is being tested.
[x] Create at least one “happy path” and one “unhappy path” test for each testable function for all class assignments
(this may not be sufficient in the real world).
[x] Use pytest for all class assignments.
Exception Handling
[x] Exception handling should always be used when calling APIs, databases, and loading files.
[x] Exception handilng should also be used to deal with unexpected input.
[x] Attempt to be as specific as possible for each possible exception type. Use the catch all except: as a final resort for errors that can not be expected ahead of time.
[x] Thoroughly read the documentation of the libraries you use to understand what exceptionsare thrown.
[x] Avoid raise if possible in any calling code that will be deployed as this will break the code and end the application.
[x] Use logging when possible instead of standard output like print statements.
[x] Limit the try clause to specific operations. For example, if you have to open a file andopen a database connection, then you should have two try-except blocks, one for the fileopening and one for the database connection.
Program Design and QA
[x] There should be NO hard coded variables in any code.
[x] Credentials and usernames should be provided as environment variables and never committed to version control.
[x] Variable names should be meaningful.
[x] Code should be structured for testability and debugging with modular functions and modules.
[x] Functions should do only one thing and be appropriately sized (unless they are orchestration functions, tying together
multiple functions to do a larger task).
[x] Existing libraries should be leveraged where possible (rather than implementing functionality yourself).
[x] Code should be PEP8-compliant.
[x] Docstrings should be PEP257-compliant.
QA should explicitly address the following about the code being reviewed:
[x] Functionality
[x] Readability
[x] Design
[x] Testing
[x] Documentation
Creating and distributing Python packages
[x] Provide instructions for how to build in the README of your package
[x] When publishing to an artifact repository like PyPI:
[x] Try to provide a built distribution
[x] Do provide a source distribution
Architechtural Considerations
[x] “Offline” inference means you will make predictions for a pre-determined set of possible input combinations and store them in your database (RDS) for later serving.
[x] Your webapp will require low latency so as you develop your model so the speed at which you can make an inference will determine what you need to do.
[x] If inference takes a longtime you should implement a system architecture with “offline” inference.Collaborative will most likely require a system architecture with “offline” inference.
Configurations
[x] All functions should have arguments (with limited exceptions).
[x] Standard convention is to capitalize all environment variable names.
[x] Centralize reading of environment variables in one configuration script.
Data Architectures
[x] Use row-based storage when you need to access entire records, such as when exposing auser’s information in an app.
[x] Use columnar-based storage when doing analytical queries, such as aggregations.
[x] Use object stores, like S3, for storing raw data (prior to any processing).
Data Ingestion
[ ] Use maps (dictionaries) to dedupe values
[ ] Use maps (dictionaries) to retain lookup data when processing
[x] When enriching a dataset with reference info
[x] For small enough reference sets
[ ] Use sets to establish uniqueness
[ ] Use arrays (lists) to iterate through records
[x] processing each field/record
[x] validation each field/record
[ ] searching for a value in a small dataset
[x] Use a file format like Avro that handles schema evolution, for datasets that have changingschemas
[x] Partitioning large datasets with a partition key like date can improve query speeds and allow you to process subsets of a dataset in batches and/or concurrently
[ ] Log out data quality/validation issues you encounter
[x] Use appropriate data types
Effective Queries
[x] Put reserve keywords (e.g. SELECT, FROM) in all upper case.
[x] Make column names all lower case.
[x] Only select the columns you need.SELECT * degrades query performance.
[x] Also, if the table you query changes schema, the result of a SELECT * query will changeas well.
[x] Place each field in the SELECT, GROUP BY, ORDER BY statements on a separate line.
[x] Place table names in FROM, JOIN, etc on their own lines to make it easy to locate whattables are being used.
Model Reproducibility
[x] Come up with a plan for versioning your code, configurations, and artifacts.
[x] In this class, use YAML to version your machine learning pipeline parameters and configurations.
[x] Store raw data and artifacts in S3.
[x] Version code, YAML using git
[x] Document your machine learning pipeline workflow with Makefile or bash script.
[x] Always set, document, and version any random seeds used.
[x] Use Docker and a requirements.txt file to control your environment.
[x] Always run your model pipeline twice and compare artifacts to ensure reproducibility.
EC2 Cloud Service
[x] Use long term persistence (e.g. S3 or RDS) when working with EC2 instances. Don’tjust keep your data on the server.
[x] If you do need a lot of fast read/writes of files, use EBS attached storage.
[x] By default, instances are not backed by EBS.Non EBS-backed machines cannot be “stopped”. They are either running or“terminated”. Any data that is not in EBS or long-term persistence like S3 or adatabase will be lost when you terminate the instance.
[x] For most workloads S3 makes more sense than EBS (slower, but cheaper).
Feedbacks
[ ] Please have final project Docker commands use a connection string, not the MYSQLenvironment variables
[x] README should have a standalone section enumerating/describing any env vars used inproject.
Implementation Guidelines
Writing Python Modules
__init__.py
, we will discuss this in the Python package module).from module import *
(except in__init__.py
, we will discuss this in thePython package module).PYTHONPATH
env variable (not viasys.path.append()
).argparse
to define command-line options and arguments.Logging
DEBUG
for things you only want to log while developing.INFO
for things you want to monitor while in production.WARNING
for things that would need attention in production.ERROR
for when an exception occurs .logger = logging.getLogger(__name__)
..gitignore
.Software Testing
tests/
folder.tests/
folder the same as the code you’re testing (e.g. same structure assrc/
).module_x.py
, name the test file,test_module_x.py
.function()
, name the test function,test_function_<descriptor>.py
, where<descriptor>
describes what is being tested.pytest
for all class assignments.Exception Handling
except
: as a final resort for errors that can not be expected ahead of time.raise
if possible in any calling code that will be deployed as this will break the code and end the application.try
clause to specific operations. For example, if you have to open a file andopen a database connection, then you should have two try-except blocks, one for the fileopening and one for the database connection.Program Design and QA
QA should explicitly address the following about the code being reviewed:
Creating and distributing Python packages
Architechtural Considerations
Configurations
Data Architectures
Data Ingestion
Effective Queries
Model Reproducibility
EC2 Cloud Service
Feedbacks