Closed EsperanzaCuartero closed 3 years ago
Hello, I am Aditya. I'm currently a pre-final year Computer Science undergrad at BITS Pilani. I'm specializing in Machine Learning.
I have worked on similar problems and datasets before wherein I used a combination of dimensionality reduction and clustering algorithms to analyze log files. I also have some experience with anomaly detection.
Do you have any representative dataset (machine logs) that I can do some preliminary analysis on? Also, could you explain a bit more about the kind of analysis expected?
Thanks!
Many thanks for your interest and sorry for the delayed response.
I am afraid that I cannot provide you a sample for the dataset at this stage. However, Matthew may be able to provide some input here.
In general we have a lot of data for a lot of different diagnostics. With this challenge, we want to start to explore this space with machine learning methods. In particular, we are interested to identify spikes in the requests before they are happening. For example a large number of data retrievals. This could, for example, be done with methods for 1-D timeseries analysis. However, this could also be done using more sophisticated analysis of the various properties of the machine logs that are available. For example via a sensitivity analysis in a multi-dimensional space or a timeseries analysis that takes many parameters into account. We would like to start simple and to increase complexity during the project. However, the applicant can have significant influence on the future directions of the project.
The detection of outliers is the main motivation for the project. However, there is much more that could be done. For example a detailed sensitivity analysis of the various diagnostics that are available within the machine logs.
Only 4 days left to apply to be part of ECMWF Summer of Weather Code 2020. Application deadline: Wednesday, 22 April 2020 at 23:59 (BST). Submit your proposal here.
Hi, I am Adithya Niranjan, a classmate of Aditya Ahuja's. I found this project really interesting and plan to work with him on this task. I've previously worked in projects on applying deep meta-learning for time-series forecasting and also on applying online classification models on EEG based time-series data, among other projects.
Just had a few questions to understand the problem better - 1) Would these logs be univariate or multivariate? Asking because this would help us decide which algorithms would be suitable - in general, multivariate time-series have temporal correlations which make the task more complex 2) Could you give an idea of how the models made will be integrated and deployed with the current system? This would help us plan the timeline in our project proposal i.e. to decide how much to spend on building the models and how much to spend tieing them together with the current system. 3) Could you give us an estimate of how frequent the spikes/anomalies are? One possible approach I was considering was to use a forecasting model to predict load and then compare the loss with the actual occurrence. This could possibly be used along with another anomaly detection model as well. But I have a feeling this would depend on how frequent the anomalies are.
Thanks!
Hi Adithya, It is great that you are interested in this challenge.
Here are my responses. Matthew can disagree if he thinks of this differently:
I hope this helps. Happy writing!
Hi Adithya,
I hope this helps.
Matthew Get Outlook for iOShttps://aka.ms/o0ukef
From: dueben notifications@github.com Sent: Sunday, April 19, 2020 10:02:52 PM To: esowc/challenges_2020 challenges_2020@noreply.github.com Cc: Matthew Manoussakis Matthew.Manoussakis@ecmwf.int; Assign assign@noreply.github.com Subject: Re: [esowc/challenges_2020] Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team (#6)
Hi Adithya, It is great that you are interested in this challenge.
Here are my responses. Matthew can disagree if he thinks of this differently:
I hope this helps. Happy writing!
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2Fchallenges_2020%2Fissues%2F6%23issuecomment-616207998&data=02%7C01%7Cmatthew.manoussakis%40ecmwf.int%7Cefe7408c15fa4493c37508d7e49443b7%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637229197759843011&sdata=DFbmT0J2GOj6TzdP78DcRqK3mdOEZ5zAr4a%2B4bKHPuc%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOKZ5CYH2LFWC676KGLV6HLRNNDFZANCNFSM4KIG3A2Q&data=02%7C01%7Cmatthew.manoussakis%40ecmwf.int%7Cefe7408c15fa4493c37508d7e49443b7%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637229197759848003&sdata=EJmueY6R0o9Bjn5HxsJK%2BsTa5JaNpywhwpErJeraZ9E%3D&reserved=0.
@dueben and @Matthew-Manoussakis, thank you for the info.
We'll cover both univariate and multivariate approaches in our proposal then. From your answers to Q2, I gather that the primary focus is on getting a good working ML/DL model on the data - we will focus on the same. For now, we have shortlisted several promising approaches, open implementations and papers on the problem - we'll summarise them in our proposal
Challenge #22 - Applying AI capabilities to address Operations challenges in ECMWF Products Team
Goal
To apply AI capabilities to analyse log data in real-time to be able to predict issues before they occur.
Mentors and skills
Challenge description
Due to the explosion of data in recent years - known as the data avalanche - many companies can no longer cope with the rapid growth in data volumes and the variety of logs produced by their IT environments. On the other hand, ensuring the services' availability and performance is more critical than ever for most businesses. Leading companies are turning to artificial intelligence (AI) for IT operations (AIOps*) to analyze data real time and predict issues before they occur. This enables them to continuously track and assess the status of their services to improve monitoring and troubleshooting.
Our services in brief
The ECMWF Meteorological Archival and Retrieval System (MARS) enables users to retrieve meteorological data in GRIB/NetCDF via:
In Products team, we are managing the services above and we provide tailored data to Member State users, commercial users and public users. Our services above produce massive amounts of multi-structured log file every day, spread in several disparate systems, which include underused or hidden valuable information.
Project description
Naturally, the scale and complexity of our services and infrastructures makes monitoring and troubleshooting an increasing challenge.
The suggested project is exploratory research, that investigates how the application of AI/ML techniques can be used to improve Operations in products team. This would enable our team to proactively understand the behaviour of our services, to take preventative actions manually or ideally through automation, to reduce MTTR and to improve user experience. If successful, the developed tools could be extended to improve the operational fidelity of other ECMWF services.
Possible datasets available:
Machine logs produced by Web-API and MARS (stored in Splunk)
Expected Outcomes
Working Python software
Additional information