aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
459 stars 153 forks source link

LARGEST INSTRUCTION COUNTS for large size input channel group conv2d #780

Closed DaeyangCho closed 1 week ago

DaeyangCho commented 1 year ago

Hi, I would like to share a problem that occurred while neuron compiling my model. For the torch.nn.functional.conv2d function, when the group value is 2 or more and the input channel value is 256 or more, the largest instruction counts log occurs and it takes too long to succeed compile. At the same time, an arithmetic operation must be added to the conv2d output tensor for the above phenomenon to occur.

If the group size is 1 or the input channel is small even if the ouput channel is large, the above phenomenon does not occur.

Here is the test sample code for reproduce.

class MyModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, x, w):
        groups = x.size(1) // w.size(1)
        x = torch.nn.functional.conv2d(x, w, groups=groups)
        x = x + 1 # If I delete this line, compile succeeds.
        return x

in_channels = 1024
out_channels = 1024
groups = 4
x = torch.ones([1, in_channels, 12, 12])
w = torch.ones([out_channels, in_channels//groups, 3, 3])
model = MyModel()
model.eval()

torch.neuron.analyze_model(model, example_inputs=[x, w])
model_neuron = torch.neuron.trace(model, example_inputs=[x, w])
model_neuron.save('neuron.pt')
print("Neuron compile success")

The reason for raising this phenomenon as an issue is because I confirmed that compilation is much faster (and largest instruction counts log is disappeared!) when the group size is set to 1 and the input and weight tensor are used in chunks. I hope this report will help improve the neuron compiler.

jluntamazon commented 1 year ago

Hi @DaeyangCho,

We were able to reproduce the problem and will see if we can implement a fix.

The reason that the compilation succeeds when the x = x + 1 line is removed is because Neuron is falling back to CPU since the graph is considered small. This is controlled by the minimum_segment_size parameter which requires 2 operations by default in order to compile for Neuron. For example, the same failing behavior is observed when you set this parameter to 1:

model_neuron = torch.neuron.trace(model, example_inputs=[x, w], minimum_segment_size=1)

For more info on how operations are partitioned to CPU or Neuron, see the trace API docs: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuron/api-compilation-python-api.html

In particular, some arguments that control partitioning between CPU/Neuron devices are: fallback, minimum_segment_size, and single_fusion_ratio_threshold

DaeyangCho commented 1 year ago

Thanks for the kind reply! I'll see the trace API docs.

delongmeng-aws commented 1 week ago

Hi @DaeyangCho, It looks like your issue has been addressed. Do you have any further questions? Otherwise, we will close this issue.

DaeyangCho commented 1 week ago

Thanks for the remind. I will close the issue.