HigashikataZhangsuke / IsoFaaS

IsoFaaS: Resource Isolation for Function as a service
1 stars 0 forks source link

First all pass version got. #6

Closed HigashikataZhangsuke closed 1 month ago

HigashikataZhangsuke commented 3 months ago

Now try to add more profiling data, as well as the part data. We may get results by Monday!!!!

HigashikataZhangsuke commented 3 months ago

Tomorrow working on three thing: 1.PPFaaS Slide -> OK, slightly tune the version. For testing results may not need too many slides so put them at the end, as the extra parts. 2.Small bug fix, and Profiling data get and MBA last experiment result. -> Waiting for collection of Profiling data, and do a "second time test" for MBA. -> Re did MBA, nothing changed. Now working on get Profiling data.

  1. Try to run some results for our system. And also finish MXFaaS's code modification. -> Modification finished, and find out they add the resource usage at the log of their nodecontroller.py . now do and modify our code, to enable multiple functions could run at the same time. Also, try to get some results of MXFaaS, to figure out their log usage. LBNL, align their and our test method script. I think it's better for us to do a at least 2/3 sec test. -> Already know what should do: change the nc part to add a configuration record; and then for the running test script, add the trace execution record. -> OK Nearly everything corrected and Ready for get final results.

Then, the day after tomorrow: Get all the results we want.

HigashikataZhangsuke commented 3 months ago

Trace Selection: Redo, since we also maybe have M-M function. 2Function Co-Run: ['che', 'mls'] ['omp', 'mls'] ['rot', 'res'] ['mlt', 'omp'] ['pyae', 'res'] ['rot', 'omp'] ['alu', 'che'] ['pyae', 'omp'] ['mlt', 'vid'] ['web', 'mls'] 4Func Co-run ['alu', 'mlt', 'mls', 'che'] ['alu', 'pyae', 'web', 'mls'] ['web', 'omp', 'che', 'res'] ['alu', 'pyae', 'mlt', 'vid'] ['alu', 'web', 'omp', 'mls']

HigashikataZhangsuke commented 3 months ago

For Profiling Data, Record these: 1、Standalone Latency Latest 2、Peak Throughput-CPU curve when Running Inside our system 3、Intel MBA memory profiler's MemBW usage. 4、Docker Stats for Memory usage. -> Svc is not like this, maybe this is the only way... Yes you can but the name resolution is a problem may leave it later, it does not matter currently. 5、Per-Func CPU usage is one, which is determined, at least you need one CPU 6、Cache Usage: also one. The minimum cache way allocated is 1 7、Other resource-peakTP curve, for global co-placement The Bold metrics are currently not important for single-node tests, and could be left until multiple-node tests.

HigashikataZhangsuke commented 3 months ago

Dockerfile need To change from gunicorn to single thread. I guess it's becaues multiple threads running together, therefore caused bad results. Try to find out why have this bug, since ultra load may need multiple threads?

Cannot find out which part caused this. Just skip this, leave it to later if we do find out sth wrong happened, or single thread is not enough for routing all requests.

HigashikataZhangsuke commented 3 months ago

Note that need to modify imgres and web code. Toolong. PF: CAlu: 0.0111907+4.2MB/s MChe:5.42909+Max2600,Avg1000 MImgRes: 0.8005+Avg 900 CImgRot: 0.7840 + Avg 500 MMLS:0.437876+AVg 3500,Peak 4000 CMLT:7.0979 + Avg 5.2 Momp:4.141488 +6000MB/s,MAX around 6200 Cpyae: 0.28206 + 11MB/s Mvid:1.348898+2000MB/s(Peak,Avg ~ 1500) Cweb: 0.30621425+ 30MB/s

HigashikataZhangsuke commented 3 months ago

Please Tune your function co-placement test trace selection, based on the new profiling data. -> I think maybe just 25,50, 100,200,400,800 as the 6 baselines, to figure out the results here. For single node test, the pktp should not exceed the maximum workload you could done. Also, need to know that use proportional method to send? but not Round Robin? Double check here, could think this problem when having dinner. Here, use MBA BW monitor to get the results, so it shall be accurate.

HigashikataZhangsuke commented 2 months ago

Need to do these things: 1、firstly, finish the last 6 func 's PKTP profiling 2、 Think about If MXFaaS‘s method of record CPU usage is correct. I think our record log could show: "What is the amount allocated to you?" and "What is the amount you really used?" since the exclusive part is always ~100. But for MXFaaS, it only records the number of the CPU mask, which is the first one I don't think the real CPU usage could be reflected by these metrics. Consider throughly

HigashikataZhangsuke commented 2 months ago

Figure out what 8 functions should use MBA and CAT(Top 8) -> OMP.MLS.VID.Che.Res

HigashikataZhangsuke commented 2 months ago

根据profiling数据,可以发现大部分情况下都是C函数,可能需要补充一下M函数。 自我的BandWidth消耗不够。只有和omp coplace的时候才能发现有interf。但是这个可能也不能不叫M函数。不对应是C函数,对C的需求大于M,先到limit。想下怎么补吧。

HigashikataZhangsuke commented 2 months ago

REs和MLS可能要处理一下代码弄成M型函数。

HigashikataZhangsuke commented 2 months ago

got all profling data. Update our Invoker, and after dinner, working on doing E-E experiment.

HigashikataZhangsuke commented 2 months ago

Some points need to figure out: 1.MXFaaS's Resource usage Definition -> Nothing Find in Paper. Try to redo some test to figure out. -> The results show that, we can use this method, actually, no matter how a trace is settled, MXfaaS seems always try to use all cpu assigned to it

  1. Enforcewindow usage for us? For MXFaaS, they use this to try to limit the test interval length -> Need this filter, add the function to our curl pod.
  2. Other containers related code small bug fix. 4.Trace setting, round robin or? -> No round robin. The thing is, for your tests, like co place alu and omp. omp at best, will have only ~1 req/s pktp. Also, some function need longer execution time.so if you use the traditional way of check: "When all request are served", the final overall TP is not correct. still need to grouping all other requests.
HigashikataZhangsuke commented 2 months ago

One example: for Che+MLS testing, we should use proportional way to allowcate: overall rate -> 16:1 ratio. And the rate selection shall not exceed the limit of system. Maybe try round robin first.

HigashikataZhangsuke commented 2 months ago

OK shall be fine, try run the first test of our and then, run MXFaaS's test.

HigashikataZhangsuke commented 2 months ago

Original problem:

  1. Previliged mode for our MXFaaS Testing, since we limit the CPU(May not used this, but the mem numa could be one possible variable.
  2. Async testing method is not correct. Should fix this bug. learn to use author's original way of multiple threading
HigashikataZhangsuke commented 2 months ago

思路不对,测试,我觉得应该去测并发的多个request,而不是单个process下的多个thread顺序发送。目前先用单个这种思路处理一个,然后,去修改多request并发,再测试。 —>应该也没问题,现在的这个思路就是round robin么,多个thread或者process分别去发可能也差不多。不过后者的好处是,你只需要确定好对应速率就行,每个点都是独立的,就会很简单。

HigashikataZhangsuke commented 2 months ago

拿到结果了,后面起来分析以下,如果没问题就继续做,发现一个情况是有事MXFaaS并没有用所有的core,这个应该是某些地方有问题,看看是不是初始化给他的时间太少了?——>并不清楚,这个点非常奇怪,感觉就是某种程度上的碰运气,启动了直接测?这个需要更仔细的研究了,但是目前应该没有时间去深究它了,先处理下目前拿到的数据吧。

HigashikataZhangsuke commented 2 months ago

还有就是想一想,对于你的结果,目前share部分几乎不会分到request,一个可能的原因是我们这里采用的思路是,检查request数量更偏向哪个,比如说,request数量用新的CPU数量不足,且过多,那么就额外给一个core。这样就不会有share使用的地方。而另外一个是因为随机抽样的去处理,总会有一些随机的情况。并且ratio如果接近1,那其实分给sh的可能性也不高。

HigashikataZhangsuke commented 2 months ago

一个quick的思路,我感觉可能还是trace生成部分就算用了一个seed但是某些代码导致结果有drift。现在去检查一下,如果不是,看看测出来的数据的情况,写一下分析脚本,如果也OK,那就可以找速率,测其他函数的情况了。——>结果:发现trace生成是一致的,seed也一样。那么目前暂时不要想怎么找代码的bug了,赶紧看看数据有没有问题然后直接出结果。

HigashikataZhangsuke commented 2 months ago

??为什么High request结果消失了??——>去看IsoFaaS代码的bug还有curl的bug,搞清楚为什么。

HigashikataZhangsuke commented 2 months ago

一个猜测是还没处理完,就关闭了?

HigashikataZhangsuke commented 2 months ago

应该是因为这里你没有提供basicdata,这个是因为用git管理文件有些时候它不会上传大文件等等导致的

HigashikataZhangsuke commented 2 months ago

MLs收不到request,去看下curl 是trigger没给,现在结果都OK。看下SH那边的情况

HigashikataZhangsuke commented 2 months ago

Sh退出机制,另外对Sh也需要增补一个设置OsEnviron的内容,不然就会出现和之前一样的卡死的问题。

HigashikataZhangsuke commented 2 months ago

I do think there is sth wrong with the MXFaaS's nodecontroller. Firstly, it says schedule every 5 sec, but actually it only works 10 sec. Secondly, the author test the system only in one second, which means before the first scheduling finished, their test already done???? This is kind of meaningless.....

HigashikataZhangsuke commented 2 months ago

Profiling data now.... For MXfaaS's bug, waiting for author's reply.