HazyResearch / butterfly

Butterfly matrix multiplication in PyTorch
Apache License 2.0
164 stars 31 forks source link

ModuleNotFoundError: No module named '__main__.complex_utils'; '__main__' is not a package #14

Open Dhrubauk opened 4 years ago

Dhrubauk commented 4 years ago

Hello, I am getting this error when i ran "butterfly_multiply" module. ModuleNotFoundError Traceback (most recent call last)

in 4 from torch import nn 5 ----> 6 from .complex_utils import complex_mul 7 8 use_extension = True ModuleNotFoundError: No module named '__main__.complex_utils'; '__main__' is not a package Can you please help me solve the problem. Thanks, Dhruba
tridao commented 4 years ago

Are you running this from an interactive session (e.g. ipython)? If so, relative import (from .blah import blah) doesn't work. This is an issue with Python import in general (https://realpython.com/absolute-vs-relative-python-imports/). If you just need to run this file from an interactive session, you can remove the dot (so from complex_utils import complex_mul). But in general we mostly run the training scripts that then call butterfly_multiply.

Dhrubauk commented 4 years ago

Hi tridao thank you for the reply, a quick question may I know, in which part of the code segment the butterfly matrix is generated . Or do you have any flowchart of slides related to this. Your help in this concern is highly appreciated

Regards Dhruba

tridao commented 4 years ago

The butterfly matrix isn't explicitly generated (as that would use O(n^2) space and takes O(n^2) time to multiply by a vector). Instead, we provide a Butterfly class that stores O(n log n) parameters and implements O(n log n) multiplication algorithm (https://github.com/HazyResearch/learning-circuits/blob/master/butterfly/butterfly.py#L76).

Dhrubauk commented 4 years ago

In butterfly connection, each node is connected to two other nodes, so why it is not 2n log n? Also can you please tell me where in the code you multiply the weight vector and the input features? and where you update the vector of weights? Thanks, Dhruba

tridao commented 4 years ago

You're right, it's 2n log n parameters. Remember to pass in tied_weight=False, like Butterfly(1024, 1024, tied_weight=False). The version with tied weights was used in an earlier paper, where we only had 4n parameters. The O(n log n) multiplication algorithm is implemented in 3 times in 3 different languages:

  1. In Python (pure Pytorch), for ease of understanding: https://github.com/HazyResearch/learning-circuits/blob/cnn/butterfly/butterfly_multiply.py#L126 The actual implementation is only 4-5 lines of Python.
  2. In C++, for efficient inference: https://github.com/HazyResearch/learning-circuits/blob/cnn/butterfly/factor_multiply/factor_multiply.cpp#L463 with Python interface here: https://github.com/HazyResearch/learning-circuits/blob/cnn/butterfly/butterfly_multiply.py#L165
  3. In CUDA, for efficient training: https://github.com/HazyResearch/learning-circuits/blob/cnn/butterfly/factor_multiply/factor_multiply_cuda.cu#L1241 With the same Python interface as 2.

The weights are stored in the attribute twiddle of the Butterfly class, which is just a Pytorch tensor. To update it, you pass it to a Pytorch optimizer just like any other trainable weight.

tridao commented 4 years ago

If you're asking about how to compute the gradient wrt to the weight:

  1. In the pure Pytorch implementation, we rely on Pytorch's autodifferentiation to compute gradients.
  2. In the C++ and CUDA implementations, we compute the gradient explicitly, and provide Python interface. If you're just using the Butterfly class, you do not need to know how backward is implemented. In any case, the algorithm is simple: treat the butterfly matrix as a product of log_n matrices, which you can think of as log_n layers. Then run back-propagation/autodiff on it. This amounts to a few for loops. The core implementation is about 20 lines of C++ (https://github.com/HazyResearch/learning-circuits/blob/cnn/butterfly/factor_multiply/factor_multiply.cpp#L733).
Dhrubauk commented 4 years ago

Okay great. I am kind of confused, how did you calculate loss or which loss function you actually used?

tridao commented 4 years ago

It can be used with any differentiable loss function. In back-propagration/reverse mode autodiff, for each layer, we are given the incoming error/grad from layers above, and use that to compute gradient wrt to the parameters of the current layer. That's exactly the interface of a custom Pytorch layer (https://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html): in the backward, given the gradient wrt to the output of the layer, compute the gradient wrt to the parameters of the layer.

Dhrubauk commented 4 years ago

Hi tridao, I use windows OS and jupyter notebook. i am having trouble with ray. Can i run your code with any modification? or do i have to change my OS?

tridao commented 4 years ago

We only use ray to run the experiments (parallel runs and hyperparameter optimization). The butterfly class (in the butterfly directory) doesn't require ray.