CoreDNS Projects for Summer of Code 2020

yongtang commented 4 years ago

Please Note: This is a tracking issue for Summer of Code. Anyone interested in this implementation should check link there.

Please Note: This is reserved for Summer of Code, we don't accept PRs outside of Summer of Code, or without coordination with maintainers.

Below are the list of candidate CoreDNS projects for Summer of Code 2020:

External health check and orchestration of CoreDNS in Kubernetes clusters

Description: CoreDNS is the cluster DNS server for Kubernetes and is very much critical for the overall health of the Kubernetes cluster. It is very important to monitoring the health of CoreDNS itself and restarting or repairing any CoreDNS pods that are not behaving correctly. While CoreDNS exposes a health check itself, the health check is not UDP (DNS) based. The existing health check is also launched locally which could be very much different when accessed by other pods remotely. This project intends to build an application that checks CoreDNS health through UDP (DNS) externally, and, restart CoreDNS pods by interacting with Kubernetes API through golang. This is an important project that will greatly improve the overall health of Kubernetes clusters through automation.
Recommended Skills: Go, DNS, Kubernetes
Mentor(s): Yong Tang (@yongtang), Paul Greenberg (@greenpau)
Implementation: The deliverable of this project is a golang program that could be deployed in a Kubernetes cluster independently while at the same time, monitoring CoreDNS pods in the same cluster and interacting Kubernetes API (server) to restart CoreDNS pods as needed.
Reference: Please see coredns/coredns#3617 for some related discussions.

Anomaly detection of CoreDNS server through machine learning

Description: As a DNS server, CoreDNS is critical to overall devops infrastructure. Any anomaly related to CoreDNS server should be taken seriously. While altering rules (combined with monitoring tools such as Prometheus) helps in discovering issues, those rules are often crafted manually and requires human expertise. It would help a lot if machine learning could be utilized to further automate the monitoring/alerting in case of anomaly. This project intends to build and train a model that could be used for anomaly detection of CoreDNS server through metrics collected from Prometheus. Since the metrics pipeline to Prometheus is already available in CoreDNS, the project’s focus is mostly on model building. It is suggested to use tf.keras to build the model. A successful model should at least be able to detect a scenario that is alerting and requires further devops or security intervention.
Recommended Skills: DNS/CoreDNS, Prometheus, Keras/TensorFlow, Python
Mentor(s): Yong Tang (@yongtang), Paul Greenberg (@greenpau)
Implementation: The deliverable of this project is a Keras model. The model should be trained and validated through the data collected from Prometheus server (where CoreDNS' metrics are exported).
Reference: Please see coredns/coredns#3541 for some related discussions.

For anyone who is interested in GSoC, please follow the RFC process set forth in this repo.

greenpau commented 4 years ago

@WJayesh , @omkarprabhu-98 , also, you could create a few diagrams describing your specific implementation. It is a great way to communicate your ideas.

wjayesh commented 4 years ago

also, you could create a few diagrams describing your specific implementation. It is a great way to communicate your ideas.

Great suggestion! I'll add a few diagrams to my proposal to complement the writing

Ideally we want to have most of the discussions public

Sounds good, I'll post the proposal here once I'm done with the illustrations

wjayesh commented 4 years ago

@yongtang @greenpau Here is the link to my proposal! Awaiting community feedback :heart_decoration:

pratikmishra356 commented 4 years ago

@yongtang I am Pratik Mishra, Machine Learning enthusiast.I found "Anomaly detection of CoreDNS server through machine learning" this project relatable and interesting for me. I want to get the idea of sample data,Could you please help me on this.

yongtang commented 4 years ago

@pratikmishra356 the best way to get sample data, is to setup a coredns server and a prometheus server, and exports coredns metrics into prometheus. Then you could easily import data from prometheus into tensorflow-io to be used in tf.keras:

Below is a complete tutorial of set up coredns, prometheus, and get the data to tf.keras: https://github.com/tensorflow/io/blob/master/docs/tutorials/prometheus.ipynb

(We are also working with tf-docs team to publish the tutorial in tensorflow.org/io, though for now, at least you can see it in GitHub)

You can click Run in Google Colab at the top of the tutorial, and following the steps. Google Colab gives you a CPU/GPU/TPU machine to use for free.

The tutorial also have a simple LSTM model. Keep in mind this LSTM model is more of a "stub getting started" model and may not be useful in real devops environment.

And for this GSoC project, the deliverable is a concrete model that could be used in real devops environment with coredns metrics.

greenpau commented 4 years ago

Here is the link to my proposal!

@WJayesh , nice job putting the proposal together. Although you added Google Docs, it will be better if you would create a pull request with your proposal. This way we could have GH-enabled review of your work.

Here is one of the way of including diagrams in Markdown. https://github.com/TLmaK0/gravizo Another one is https://www.draw.io/?mode=github

wjayesh commented 4 years ago

it will be better if you would create a pull request with your proposal

@greenpau What repo should I use for this PR?

greenpau commented 4 years ago

What repo should I use for this PR?

@WJayesh , this repo for now. Please prepend "WIP: " to the title of your PR. This way CI does not run. I would suggest creating the following file assets/gsoc2020/wjayesh/README.md.

The PR does not have to be merged, but you get your own personal discussion thread 😄

miekg commented 4 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] CoreDNS Proje..." ]

What repo should I use for this PR?

@WJayesh , this repo for now. Please prepend "WIP: " to the title of your PR. This way CI does not run. I would suggest creating the following file assets/gsoc2020/wjayesh/README.md.

The PR does not have to be merged, but you get your own personal discussion thread 😄

I would argue against that: none of the other maintainers may have seen this thread and everyone will get pinged because there is no owner for where this code is going to land.

If these proposels require tracking from our side we should just create a new repo, possibly with .dreck.yaml and CODEOWNERS and deal with it there. (Then we can also merge these requests properly)

greenpau commented 4 years ago

If these proposels require tracking from our side we should just create a new repo,

@miekg , name for the tool?

miekg commented 4 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] CoreDNS Proje..." ]

If these proposels require tracking from our side we should just create a new repo,

@miekg , name for the tool?

this is gsoc proposals no? coredns/proposal or coredns/rfc ?

greenpau commented 4 years ago

this is gsoc proposals no? coredns/proposal or coredns/rfc ?

@miekg , that works! Could you please create the repo? I like coredns/rfc 😄

miekg commented 4 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] CoreDNS Proje..." ]

this is gsoc proposals no? coredns/proposal or coredns/rfc ?

@miekg , that works! Could you please create the repo? I like coredns/rfc 😄

https://github.com/coredns/rfc

/Miek

-- Miek Gieben

yongtang commented 4 years ago

@miekg @greenpau I opened a PR to add README.md and RFC Template in https://github.com/coredns/rfc

Also, GitHub has an option of "Transfer Issue" (on the right side of the web page below participants) which I think we can move this issue to https://github.com/coredns/rfc as it fits the scope there.

Can you move this issue to the rfc repo? Or you can add write access to rfc repo then I can move the issue as well.

yongtang commented 4 years ago

@miekg @greenpau Update: the PR is https://github.com/coredns/rfc/pull/1

yongtang commented 4 years ago

For anyone who is interested in GSoC, please follow the RFC process set forth in this repo.

yongtang commented 4 years ago

For anyone who is interested in playing with prometheus + tf.keras, the tutorial is now pushed to

https://www.tensorflow.org/io/tutorials/prometheus

You can run the tutorial directly from Google Colab.

cekbote commented 4 years ago

Hi.

I am interested to work on the project titled 'Anomaly detection of CoreDNS server through machine learning'.

I have been working in the field of artificial intelligence for over two years now, and during the summer of 2019, I was the recipient of the Indian Academy of Sciences, Summer Research Fellowship, which enabled me to intern at the Computer Sciences and Automation Department at the Indian Institute of Sciences (IISc) Bengaluru, under the guidance of Dr Shalabh Bhatnagar. Our group worked on a project titled ‘A unified reinforcement learning framework for demand and supply-side management amongst microgrids’ and published a paper for the same.

Due to a collaboration between IISc and AlphaIC (a startup that focuses on creating edge computing devices), I also got an opportunity to work on a project titled ‘Experiments on Low Shot Learning’ where our main goal was to test and create autoencoder based deep learning models that generated decorrelated embeddings which would, in turn, enhance low shot learning accuracies.

Moreover, I worked on a project titled 'Investigating Auction Mechanisms for Smart Grids' wherein which I used the DDPG algorithm, so that smart grids could intelligently bid for energy at the right prices. My project guide and I are currently in the testing phase, and hopefully, this will turn into another paper. (Fingers crossed.)

The project that has been floated is something that I would love to work on. I have used Keras,/Tensorflow, as well as PyTorch extensively and would have no issues working with that.

I went through the discussion, however I am still not very clear about what 'anomalies' do you want to detect? Moreover, is there some project task that I should complete before I submit the proposal?

Please find my resume here.

greenpau commented 4 years ago

I went through the discussion, however I am still not very clear about what 'anomalies' do you want to detect? Moreover, is there some project task that I should complete before I submit the proposal?

@Chanakya-Ekbote , one of the key elements of the project is defining what an "anomaly" is. For example, how do you detect when a DNS client exfiltrates data from a company via DNS. Reference: https://blogs.akamai.com/2017/09/introduction-to-dns-data-exfiltration.html

yongtang commented 4 years ago

Thanks @Chanakya-Ekbote for the introduction. The CoreDNS GSoC project is trying to apply machine learning into devops and DNS field, with the hope that this will lead to further automation in operational (and reduce the workload and time spend by devops individuals).

With respect to the term anomalies, there are several types of anomalies.

Security anomaly where it typically means security break or risk such that an immediate action has to be taken in order to prevent further data privacy or financial loss. The example provided by @greenpau is a good example.
Operational anomaly where it typically means some services (such as CoreDNS) is not functioning correctly. As DNS is a critical service, this could results in a wider service (e.g., web serves, database servers that needs DNS) interruption and your users are unable to access any services. In case of operational anomaly devops will have to be alerted and quickly identify the issues and try to remedy as soon as possible. To give an example. In case of CoreDNS, normally you expect the memory usage roughly maps to the number of concurrent incoming queries. However, in case of anomaly you might see big memory usage with apparently no reason. This could indicate CoreDNS is not functioning correctly and you want to alert the devops immediately (or even better, try to restart CoreDNS server automatically without devops intervention). One challenge with anomaly detection is that, not all the information are easily available explicitly. For example, how to define the incoming queries? Keep in mind from the coredns this is the number coredns is able to report, not the number reaches to coredns. (to process the incoming queries and bump the counter itself takes resource as well). So the challenge is to collect more information (or not even obviously related) and see if this can lead to better finding. Another example is the increase usage. For example, in one event Zalando experienced a spike in usage: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md This results in a total outage. An early altering would help a lot. As you could see from the graph in the link, this is more or less a time series to identify the anomaly that does not match the past pattern. There are quite a few existing ml algorithms that may already fit.

cekbote commented 4 years ago

@greenpau @yongtang Thank you for clearing some of the queries that I had. From the materials that you have shared, what I garner is this: The anomalies that are detected do not correspond to well-annotated data. What I mean by that is that the anomaly data is a) Not well-defined b) Not well labelled. Moreover, new anomalies may arise that are not a part of the 'error data distribution' from which we have trained the model on. Please correct me if I'm wrong as I would like to understand the problem in greater depth.

If my understanding is correct then this represents an unsupervised learning problem. However, there could be clever ways to work around the 'unsupervised' nature of the problem by using Autoencoders (or Variational Autoencoders) as well as Moment Matching Networks to name a few. This comparison between anomalies can be done by looking at the latent vector representation of the time series data and using a distance metric for the same, to understand which cluster it belongs to. (Each time-series data could be made into an image by combining sequences of past and present inputs.) The issue with this approach is that some amount of labelled data would be required. Hence, if you could explain some properties about the data you intend to work with, that would help elucidate the problem to a great extent. If no data is available at all, then using unsupervised methods may be the only way forward. Moreover, if you could also share some idea about what other approaches you feel might help in anomaly detection, that would be perfect.

yongtang commented 4 years ago

@Chanakya-Ekbote The GSoC CoreDNS project itself is not a pure machine learning task. A pure machine learning task could have well-defined input/output and participants only need to play with different model layers to find a better fit. We do expect the participants to have some basic understanding of DNS and related metrics. Otherwise it will be very difficult to come up with good models. I would suggest you to take a look at coredns and prometheus itself (ideally setup your local server). Both are pretty easy to setup to get started.

cekbote commented 4 years ago

@yongtang Alright. I'll set up my local server and get back to you in a while. Thank you.

cekbote commented 4 years ago

@yongtang @greenpau I set up the local server as we had discussed and also messed around with getting the metrics in order to understand what the data is. Moreover, I also researched on some anomalies that usually occur such as DNS Tunneling, DNS Exfiltration, excess memory usage etc.

I also went through the alerting rules of Prometheus where each engineer has to manually write 'rules' to check whether an anomaly has occurred or not. Naturally, this process requires a lot of testing as well as experience in the field and moreover it may so happen that the engineer may not include some edge cases that may give false positives or false negatives. Moreover, these anomaly detection rules are only for specific anomalies and hence may not generalise to other anomalies. Hence, it would make sense to model this as a machine learning problem where the models could learn the relations between the metrics and the anomaly.

In addition to this, I went through the Prometheus + tf.keras tutorial. It was very easy to follow and the LSTM model at the end of it, made me understand the goal of the project.

Coming to the detection part which we now have to handle through a model (because the existing methodology of defining 'rules' is cumbersome and does not generalize well), I have a few queries:

The main query I have is about the data itself and its relation to anomalies. Since anomalies are of different types and because of different causes, we would need a lot of data to make sure we collect all the samples that correspond to normal data and samples that correspond to anomalies. Now if we train a model on this, the model will more or less be pretty robust, but the issue lies in getting this data. Hence, would we be generating this data through certain CoreDNS experiments, or is there some existing repository where this data would be available?
Do you have any preference whether the model should be supervised or unsupervised?

greenpau commented 4 years ago

Hence, would we be generating this data through certain CoreDNS experiments, or is there some existing repository where this data would be available?

@Chanakya-Ekbote , I am not aware of such repository. You will likely rely on the data that you "export". I would say there is a high likelihood that you would need new metrics being exported. I would suggest watching https://www.youtube.com/watch?v=Zk09Mbu0YQk

yongtang commented 4 years ago

@Chanakya-Ekbote @greenpau we could also see if there are any interest from some organizations to participate and help donate some metrics data for a coredns production server. This could be kind of tricky as even though metrics data does not mean PII, many security team in many organizations may not be happy to see data shared to outside. Another option is the packt server from CNCF, we could also get some data from simulated tests.

/cc @chrisohaver @rajansandeep do you know if some metrics data could be available?

cekbote commented 4 years ago

@greenpau @yongtang I saw the video in the link and am totally open to creating new metrics if required, but still, the base question still remains.

If no repository exists, then we have to create the data ourselves. Now would we go about labelling this data ourselves depending on various experiments? (What I mean by experiments is deploying various different computers having different capabilities on a local server and labelling whether an anomaly occurred or not, ourselves. We could also vary the number of computers.) This method would work very well for a machine learning problem, however, the issue is that labelling this data is very cumbersome.
The second solution is that we ignore the anomaly data. We focus only on generating normal data. This would again need the 'experiments' as described above. Then by using a machine learning model, we could infer that any data distribution that is close to the non-anomaly data distribution (using KL divergence, maximum mean discrepancy or even looking at the distance in latent space) would not be considered as an anomaly. (Here close would be defined based on a particular threshold, and if the distance or divergence metric is less than a particular threshold value, it will be close.). If the data distribution is far from this non-anomaly data, then it will be considered an anomaly. The issue with this is it may generate some significant false positives and false negatives. To combat these false positives and false negatives, we could send this report to the onsite engineer, who can label them as anomalies or not, and then the model can be retrained.
If a repository does exist, then it's perfect, we can then brainstorm about what models to use.

Note : While trying to find some data, I found this: https://securitytrails.com/dns-trails. It seems to have past DNS data, however, it's not strictly open source. Would this be of some use in our application? (I think it will be of some use if we do get this data, but I wanted to run it by you once as you have more experience as well as expertise in this field than I do). The only issue is, I don't think the Prometheus data is included in this.

yongtang commented 4 years ago

@Chanakya-Ekbote this is an ongoing effort as you could see our intention is to replace the tedious human driven alerting rules (various by expertise level) with automatic detection through ml. For GSoC we don't need a complete solution that solves ALL problems, we only need to have at least a solution that solves ONE problem with good quality.

So, to move forward, there are several options:

Setup a server and simulate the normal traffic in Kubernetes clusters. We could use packet servers, or apply to CNCF for some credit if the project is formally accepted in GSoC. We simulate several normal scenarios (could be based on the real world scenario like Zalando case). The data we collect will be used for training and validation purposes.
Get access to a production server and is able to log metrics for a certain period of time. We assume all traffic is normal and build a model that capture the collected data. The validation could be done by injecting non-normal data into the picture.
Get access to a production server and is able to log metrics for a certain period of time. We use human crafted alerting rules to identify (and label) normal data. The labeled data will be used for training purposes. Note in this approach the labeling depending on the effectiveness of the alerting rules so it depends. However, we want to start from "somewhere".

For https://securitytrails.com/dns-trails I will take a look. However, I assume they are more security related. One thing about the security related DNS anomalies, is that they tends to be "per-source-ip" related. The generic "all-ip grouped" metrics or events will not help that much.

In many times in order to detect a DNS security attack the first step is to build up a hash table to differentiate queries from different IPs. This is not possible in metrics in CoreDNS so I am not sure this could be an direction we want to tackle, without significantly updating the metrics plugin in coredns.

greenpau commented 4 years ago

One thing about the security related DNS anomalies, is that they tends to be "per-source-ip" related. The generic "all-ip grouped" metrics or events will not help that much.

@yongtang , very good point 👍 I would say network ranges carry importance, e.g. guest networks, user networks, server networks, etc. It may be a good idea to track requests on per-range (per-network) basis in prometheus plugin.

cekbote commented 4 years ago

@yongtang Thank you for the clarifications. I totally agree with all of those approaches, and I believe that trying all of these methods would be the best way forward. What could be done for the GSOC is that all three methods could be tried initially, and then depending on the results that we get, we could then decide what approach works best for us? Moreover, by implementing all the three approaches, we could also garner some intuition about what works best for the Prometheus data and then we could combine the best features of all the three approaches to get a better model.

With regards to the machine learning model we could use:

I was thinking of using an CNN based autoencoder for converting time series data into a latent vector. The time-series nature of the data can be addresed by combining some time series data into an 'image' and then this 'image' will be changed per time step by removing a row of time series data and adding another row of time series data. Something like a FIFO stack. Now the distance between the latent vector of normal and abnormal data can be checked to identify how different the normal data is from the abnormal data. Then if this distance is greater than a particular threshold, it would be classified as abnormal data, else it would be classified as normal data.

Please let me know your thoughts.

With regards to the production server:

Since this can be a server where we run our experiments, do you have a cap limit on how many users that would be present on the server? I think we could use various different configurations of the number of machines as well as the type to get unbiased data? Moreover, could we access logs of servers of some open-source organisations as well? This would help in the third point that you discussed where we could initially use human rules.

Apologies for the late reply, I was traveling yesterday, hence the delay. Apologies for any inconvenience caused.

cekbote commented 4 years ago

@yongtang @greenpau

greenpau commented 4 years ago

@Chanakya-Ekbote , please create your PR with proposal just like @WJayesh did.

cekbote commented 4 years ago

@greenpau Sure. I will submit my proposal in the next 3 to 4 days. Thank you.

cekbote commented 4 years ago

@greenpau @yongtang I was asked to vacate my university, because of the onset of COVID - 19. I was unable to finish my GSOC proposal due to that. I will send a proposal within the next two to three days. Apologies for the delay. I hope all is well wherever you are. Please take care.

yongtang commented 4 years ago

Thanks @Chanakya-Ekbote for the update, and sorry for late replies as you might imagine given the current COVID status. Hope all is well for everyone 👍

cekbote commented 4 years ago

@yongtang @greenpau Should I submit my proposal as a commentable doc file or as a PR?

yongtang commented 4 years ago

@Chanakya-Ekbote Please submit a PR with template https://github.com/coredns/rfc/blob/master/yyyymmdd-rfc-template.md, you can check README.md of this repo for more info.

cekbote commented 4 years ago

@yongtang @greenpau I have submitted a PR for the GSoC proposal. Apologies for the delay. Please let me know your feedback.

Sylfrena commented 4 years ago

@yongtang

Hi,

I stumbled on this project a few days ago while searching for open source anomaly detection in ITops and would love to submit a proposal for the soc even though it is rather last moment.

I went through this thread and have a couple of questions regarding the project. I came across some algorithms that might turn out to be useful but some of them are not implemented in the Keras library. Does the model have to be strictly Keras based or are you open to exploring different machine learning/deep learning libraries that are also used for similar use cases ?

yongtang commented 3 years ago

The project has been finished and is available:

https://github.com/cekbote/coredns_ml_plugin

https://mlbridge.github.io

coredns / rfc

CoreDNS Projects for Summer of Code 2020 #2