Closed aravindhank11 closed 5 months ago
Hello! Thanks for your interest in Orion!
Can you please mention what is your PyTorch and CUDA version? PyTorch/CUDA version affects the number of kernels you have, and their profiles.
Could you also please try the '/root/orion/benchmarking/model_kernels/mobilenetv2_4_fwd' example? You can have sth like:
{ "arch": "mobilenet_v2", "kernel_file": "/root/orion/benchmarking/model_kernels/mobilenetv2_4_fwd", "num_kernels": 152, "num_iters": 12000, "args": { "model_name": "mobilenet_v2", "batchsize": 4, "rps": 40, "uniform": false, "dummy_data": true, "train": false } }
Thank you for the quick turn around, @fotstrt
I am using the provided docker container. So my versions are:
>>> import torch
>>> print(torch.__version__)
1.12.0a0+git67ece03
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
Further, I can confirm that using the following config works:
[
{
"arch": "mobilenet_v2",
"kernel_file": "/root/orion/benchmarking/model_kernels/mobilenetv2_4_fwd",
"num_kernels": 152,
"num_iters": 100,
"args": {
"model_name": "mobilenet_v2",
"batchsize": 4,
"rps": 30,
"uniform": true,
"dummy_data": true,
"train": false
}
}
]
Is there any reason why /root/orion/benchmarking/model_kernels/mobilenetv2_32_fwd did not work?
Hi! Thanks for checking!
Yes, unfortunately, the specific configuration file might have been misplaced since we don't use it anywhere (and forgot to remove it during cleanup). I will try to do a cleanup and remove/replace files accordingly asap. Also, if you want to profile your own models, please find instructions here: https://github.com/eth-easl/orion/blob/main/PROFILE.md.
Thank you for bringing this to my attention, and apologies for the inconvenience!
Thank you @fotstrt. I did try the steps to profile my model, and I suppose the torch and cuda version I profiled it was different than the one in docker container. So I am repeating the steps now.
But in the process, I am observing that the torch in the container is not compiled with numpy support:
>>> preprocess(Image.open(image_path))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/vision/torchvision/transforms/transforms.py", line 94, in __call__
img = t(img)
File "/vision/torchvision/transforms/transforms.py", line 134, in __call__
return F.to_tensor(pic)
File "/vision/torchvision/transforms/functional.py", line 164, in to_tensor
img = torch.from_numpy(np.array(pic, mode_to_nptype.get(pic.mode, np.uint8), copy=True))
RuntimeError: PyTorch was compiled without NumPy support
I was using a pre-process function to create batches for inference workload. Since numpy was not available, I examined the codebase of orion which seems to be performing inferences not in batches, but 1 image after the other as in: https://github.com/eth-easl/orion/blob/main/benchmarking/benchmark_suite/train_imagenet.py#L180-L192.
Is this intentional or am I understanding it wrong?
In our experimental setup, we are trying to simplify things to see what is really happening in the GPU, and examine all policies under tight cases where there is not a lot of preprocessing. However, we still use batches, but we prepare the torch tensors like here: https://github.com/eth-easl/orion/blob/main/benchmarking/benchmark_suite/train_imagenet.py#L36 (note that batch size is the first dimension of the tensor).
So we do it in batches, but the tensors are pre-made to avoid preprocessing times that might influence the performance and our conclusions. Does it make sense?
Cool, this is perfect :) Thank you for the patient responses!
If I were to write my own inference function with code as simple as the one in https://pytorch.org/hub/pytorch_vision_mobilenet_v2/. Would I have to consider anything to integrate with orion (apart from profiling and creating a config file)?
As I see at https://github.com/eth-easl/orion/blob/main/benchmarking/benchmark_suite/train_imagenet.py#L79-L83, there seems to be other parameters such as: local_rank, barriers, client_barrier, tid, input_file
which are not from the config file args as in https://github.com/eth-easl/orion/blob/main/artifact_evaluation/example/config.json.
Is there any guide on how to use and config these variables?
I recommend having a look at this file: https://github.com/eth-easl/orion/blob/main/benchmarking/launch_jobs.py. You can use it as a test with e.g. 1 model, or with more to test intereference. It will basically spawn a thread for each model/script. You can find how the arguments are passed here: https://github.com/eth-easl/orion/blob/main/benchmarking/launch_jobs.py#L81
I hope that helps!
Thank you for all the help! I got a toy example up and running :) Much appreciated and amazing work on this!! I enjoyed reading the paper and trying it out!
If I may, I have another question: Is the first job in the config_list always regarded as high priority, with the others assumed to be best-effort tasks?
Hey, sorry for the late reply!
No, actually the last job in the config_list is high-priority, and all others are best-efforrt, e.g. here: https://github.com/eth-easl/orion/blob/main/artifact_evaluation/fig7/config_files/bert_mnet.json the MobileNet inference job is the high-priority one. Also, our current version in this repo works with 2 clients.
(we have implemented and tested Orion with more clients but haven't merged yet. We hope to do it soon!)
Thank you, that makes sense... I have been using orion since yesterday with 3 clients. It seems to be working - unsure if it is working as expected.. Can you please let me know if there are changes to the shared library?
I wrote my own wrapper like PyScheduler to suite my needs.
Yes i will update you! I would recommend testing Orion with 2 clients first, and checking the scripts we have under https://github.com/eth-easl/orion/tree/main/artifact_evaluation (e.g. run https://github.com/eth-easl/orion/blob/main/artifact_evaluation/fig7/run_orion.py which collocates high-priority inference with best-effort training job) to see the expected behavior of the system
Thank you! I shall close the issue. Thank you for all the help though!
Do you mind if I create a new issue to track usage orion with 2+ clients? There by you can mark it closed once done and I can start using it?
Of course! Also please let me know if there are any problems with the 2-client setups!
Thanks again for your interest in Orion!
Last qn and sorry for more question: From the configs you pointed out, there seems to be usage of additional_kernel_file. Could you please let me know how to build this and how it is used?
The specific file is used when there is a training job - we observed that the kernels in the 1st iteration are different than in the rest, so we needed to profile and generate 1 extra file. All files are included under config_files
. If you are not interested in training, you can have a look at the files here: https://github.com/eth-easl/orion/tree/main/artifact_evaluation/fig10
Superb. Yes, I am interested in inference workloads. Closing this issue. Thank you again!
I am trying out orion with various configurations from the given example.. The example given at
https://github.com/eth-easl/orion/tree/main/artifact_evaluation/example
works well. However the same for mobilenet_v2 does not seem to work.Environment
Config used:
Error state:
It does not make any progress post this. Am I configuring things wrong? I have made no changes to the kernel info file at
/root/orion/benchmarking/model_kernels/mobilenetv2_32_fwd
Any help or pointers would be greatly appreciated :)