Work with external customers and partners to help make them successful
Respond to, troubleshoot and drive root cause analysis (RCA) of complex live production incidents and cross platform issues handling OS, Networking and Database in a cloud-based SaaS / IaaS environments by following and implementing SRE best practices
Continuously monitor, analyze and measure the availability, latency and overall system health using tools like Prometheus, Stackdriver, ElasticSearch, Grafana and SolarWinds as well as develop steps to improve system and application performance, availability and reliability
Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available
Keep up-to date with security and proactively identify, diagnose, and solve complex security issues
Maintain and monitor deployment, orchestration of the servers, docker containers, databases, and general backend infrastructure
Apply automation to any tasks or parts of the system that would benefit from it or are performed manually
Utilize Atlassian Jira to track issues to resolution based on their priority
What You Need to Be Successful (Skills)
Must Have
Advanced knowledge of the Incident Management processes and ability to resolve issues within agreed organization SLA/SLO
Advanced knowledge of Linux operating systems (Ubuntu, CentOS, etc.)
Advanced knowledge of container-based architecture (Kubernetes)
Advanced knowledge of tools like Ansible, Python, Bash, Go, PowerShell and other scripting language
Intermediate knowledge in algorithms, data structures and databases (SQL/NoSQL)
Intermediate knowledge of networking concepts
Intermediate understanding of cloud environments such as GCP or AWS
Intermediate knowledge of site reliability engineering principles
Education
BS in computer science or equivalent or 10+ years professional experience.
Nice to Have
About Our Company
We are looking for a CRE with a deep understanding of complex distributed system platforms/cloud technologies and ability to simply articulate it to customers and SREs within a customer organization.
You will have the opportunity to work with your teammates and our customers to support many new, leading-edge technologies that solve real challenges. You will work to provide robust feedback and guidance to our Product and Engineering teams while being a voice for our customers. You want to make our customers successful while strengthening their relationship with NetApp. You can make a huge impact and have real ownership for the work you do.
What You'll get
Salary Expectation
USD 166500 - 203500 per year
Location
Remote (United States of America)
What You'll Do (Job Description)
What You Need to Be Successful (Skills)
Must Have
Education
Nice to Have
About Our Company
We are looking for a CRE with a deep understanding of complex distributed system platforms/cloud technologies and ability to simply articulate it to customers and SREs within a customer organization.
You will have the opportunity to work with your teammates and our customers to support many new, leading-edge technologies that solve real challenges. You will work to provide robust feedback and guidance to our Product and Engineering teams while being a voice for our customers. You want to make our customers successful while strengthening their relationship with NetApp. You can make a huge impact and have real ownership for the work you do.
How to apply
https://kube.careers/customer-reliability-engineer-netapp-jo96
Meta