0805 - Githubissues

HigashikataZhangsuke commented 1 month ago

1.Finish Debugging of all Functions, make them virtualization version works. Then, collect all profiling data needed. ->expected time 5+ h

HigashikataZhangsuke commented 1 month ago

For 1 Need more time. Not only need to rollback the version of the code, but also re-form other parts, especially for the perf monitor parts.

HigashikataZhangsuke commented 1 month ago

For 1 Need more time. Not only need to rollback the version of the code, but also re-form other parts, especially for the perf monitor parts.

For logger we always keep containers warm, so no need for thinking about "overhead of saving logger to local." We can always download them after the whole trace finished testing.

HigashikataZhangsuke commented 1 month ago

For 1 Need more time. Not only need to rollback the version of the code, but also re-form other parts, especially for the perf monitor parts.

Add functionality about record the CPU/other resource usage. Need to think how you would monitor throughput -> You need to record st, et,dt. For TP, you can get it from: (Total Successfully Processed Req)/Total time. Or check for a given time interval, how many reqs are successfuly processesed. For the "top tp" of the trace, use differential method.

Use unified naming methods for Redis clients you need. Re-write all your codes and scripts.

Flask will maintain two queue. the ratio use profling data + Mask to decide. So the structure like this now:

Conflict of continuous Listening Mask and the Flask APP... ->Not a problem now.

HigashikataZhangsuke commented 1 month ago

1.Finish Debugging of all Functions, make them virtualization version works. Then, collect all profiling data needed. ->expected time 5+ h

Currently finished first function's EX part modification. After dinner do two thing: 1.If all codes work? 2.Does the perfmance match the gap with 0618 slides? -> Note use function with profiling data since you need this for Flask app routing requests.

If this works, firstly check if MLserve code could work? If that works, then you can simply transplant all functions with this pattern and try to get the profiling data.

And lastly working on sh part. Sh part will be more complicated, since you need to handle all "lefted" requests.

With these finished I think it's already around 11pm. When back to home read papers, to think carefully about your experiment design. You are the owner of this Proj. You want higher quality.

Remember to submit your code to git. No local version.

HigashikataZhangsuke commented 1 month ago

Carefully deal with Redis clients. Remember when testing PPFaaS, you findout that multiple function use same Redis client cause overloaded. For each function, prepare one data channel and one update cpu mask channel for isocontainer. For the node level, prepare one update channel, function number Flaks channel to update the rate and then prepare one channel for your share container. If the channels does not includes too many push operation, you can use one Reids client. But for the request receiver, you need to make them in separate VMs.

HigashikataZhangsuke commented 1 month ago

1.Finish Debugging of all Functions, make them virtualization version works. Then, collect all profiling data needed. ->expected time 5+ h

Currently finished first function's EX part modification. After dinner do two thing: 1.If all codes work? 2.Does the perfmance match the gap with 0618 slides? -> Note use function with profiling data since you need this for Flask app routing requests.

If this works, firstly check if MLserve code could work? If that works, then you can simply transplant all functions with this pattern and try to get the profiling data.

And lastly working on sh part. Sh part will be more complicated, since you need to handle all "lefted" requests.

With these finished I think it's already around 11pm. When back to home read papers, to think carefully about your experiment design. You are the owner of this Proj. You want higher quality.

Remember to submit your code to git. No local version.

Eventually find out the reason. Next time, when you are write code running torch on single cpu, you need to set OMP and MKL and torch thread three environment variables. If they are not 1, the overhead of the torch itself will make the progress and execution to slow. which looks like stuck. Leave this issue open since you didn't finish sh part's code. Read some papers today.

HigashikataZhangsuke commented 1 month ago

No need for Flask tunning. same redis, same flask is OK. Just think you will know why.

HigashikataZhangsuke / IsoFaaS

0805 #3