https://github.com/microsoft/DeepSpeed/pull/3457

juntao commented 1 year ago

Hello, I am a code review bot on flows.network. Here are my reviews of changed source code files in this PR.

deepspeed/module_inject/replace_module.py

Here are some potential issues and suggestions for improvements in the provided code:

In the ReplaceWithTensorSlicing class, merge_assert() method takes two arguments, dim1 and dim2, followed by an assert statement. Consider adding custom error messages to assist the user in understanding what caused the error.
In the ReplaceWithTensorSlicing class, strided_copy() and copy() methods have a lot of nested if statements. Breaking down these methods into smaller, more specific methods can help improve the readability and maintainability of the code.
In the strided_copy() method, this line dst = dst.reshape(-1).data.copy_(src.data.reshape(-1)).reshape(src.shape) can be broken down into multiple steps to improve the readability.
In the GroupQuantizer class, the quantize() method has several tensor reshaping and transformations. Consider breaking down this method into smaller helper functions to improve readability and maintainability.
In the replace_transformer_layer() function, the global variable container_g is being used. Using global variables can cause issues related to variable scope and data sharing between functions. Consider refactoring the code to avoid using global variables.
Several functions in the code, such as replace_with_policy(), replace_wo_policy(), and other nested functions, have many parameters and contain a significant amount of logic. Consider breaking these functions down into smaller helper functions to improve readability and maintainability.
In the replace_transformer_layer() function, some variables are initialized but not used (for example, linear_layer_setting, micro_batch_size, seed, and local_rank). It is better to remove the unused variables or comment them out if they will be needed later.
Consider using more descriptive variable names in some cases, such as mp_replace, which could become tensor_slicing_replacer. More meaningful variable names can make the code more readable and easier to understand for other developers.
Add docstrings and type hints to functions and methods, especially those with complex logic or multiple parameters, to help users and other developers understand their purpose and how to use them correctly.
Some parts of the code have deep levels of indentation. To improve readability, consider using helper functions to break the code into smaller, more manageable pieces.

The key changes in the provided patch are:

In the copy() function, a more concise assignment for dst is used:
```
dst = src if src.dtype == dst.dtype else dst.data.copy_(src)
```
A replaced attribute is set to True after replacing children in the _replace() function within the GroupQuantizer class:
```
setattr(child, "replaced", True)
```
The _replace_module() function includes a new argument prev_class_name and changes some variable settings to facilitate better renaming of modules:
```
def _replace_module(r_module, prev_name='', prev_class_name=''):
```

OPTLearnedPositionalEmbedding from the transformers package is added as one of the load_layers in the _replace_module() function, improving compatibility:

try:
import transformers
OPTLearnedPositionalEmbedding = transformers.models.opt.modeling_opt.OPTLearnedPositionalEmbedding
except:
OPTLearnedPositionalEmbedding = None
load_layers = [nn.Linear, nn.Embedding, nn.LayerNorm, OPTLearnedPositionalEmbedding]

In the load() function, an alternative strided_copy() is used instead of the qkv_copy() function, with the parameter num_splits set to 3:
```
module.weight = mp_replace.strided_copy(module.weight.data, state_dict[prefix + 'weight'], num_splits=3)
```

tests/unit/inference/test_checkpoint_sharding.py

In general, the code is quite readable and well-structured. There are a few observations and areas where improvements can be made:

In the save_shard class, it's better to add an __init__ method to initialize class variables like world_size. Also, if class_tmpdir is only initialized once, it's better to store it as a class variable too.

class save_shard(DistributedFixture):
    def __init__(self):
        self.world_size = 2
        self.class_tmpdir = None

    def run(self, model_name, class_tmpdir):
        self.class_tmpdir = class_tmpdir
        ...

In the save_shard class, run method: Please add a docstring to provide a brief description of the method's functionality and its arguments.
In the TestCheckpointShard and TestCheckpointShardinAutoTP classes, it's better to add an __init__ method to initialize the world_size class variable like in the save_shard class.
In the test method of TestCheckpointShardinAutoTP class, the write_checkpoints_json function may be better suited as a separate utility function outside the class, as it does not access any instance variables of the class. Additionally, please add a docstring to provide a brief description of the function's functionality and its arguments.
For better code readability and organization, it may be a good idea to separate import statements by categories. Keep standard library imports, third-party imports, and application-specific imports in separate groups.

Overall, the code is well-written with proper comments and structured format. Making these few improvements should help in enhancing readability and maintainability.

The patch introduces a new class TestCheckpointShardinAutoTP that inherits from DistributedTest with a world_size of 2. Its purpose is to test checkpoint sharding functionality in AutoTP.

Key changes in the patch:

Importing additional modules: deepspeed.comm as dist, huggingface_hub.snapshot_download, and transformers.utils.is_offline_mode.
A new method write_checkpoints_json is introduced in the test method of TestCheckpointShardinAutoTP class. This method writes checkpoint JSON files to the specified directory. It downloads the required model only on the first process and ignores certain file types when downloading: ".safetensors", ".msgpack", and "*.h5".
inf_config is updated with different settings for replace_with_kernel_inject and checkpoint properties.
The model is loaded on meta tensors in a new way by using AutoConfig.from_pretrained and wrapping the model creation in a deepspeed.OnDevice context manager with dtype set to torch.bfloat16.

Overall, this patch expands the test coverage by introducing a new test class for checkpoint sharding functionality with AutoTP. The new class includes an additional function to write checkpoint JSONs and uses a different config and model loading process on meta tensors.

cc https://github.com/microsoft/DeepSpeed/pull/3457

juntao commented 1 year ago

flows summarize

juntao commented 1 year ago

Hello, I am a code review bot on flows.network. Here are my reviews of code commits in this PR.

Commit 72a49c66708c3e3da979b93f14af955d5542e21a

Key changes in this patch involve adding a new unit test case for shard checkpoint loading in AutoTP (Auto-Tensor Parallel) and modifying the replace_module.py file.

A new unit test case TestCheckpointShardinAutoTP has been added to the test_checkpoint_sharding.py file. This test downloads model checkpoints, writes a JSON file with checkpoint paths and then loads the model on meta-tensors while initializing the inference.
A new variable OPTLearnedPositionalEmbedding has been added to the load_layers list in replace_module.py.
A function, load_buffer, has been modified to copy the buffer data only if the data is not already in the destination buffer. It checks whether the src and dst data types and tensors are the same before copying the data.
The _replace_module has been updated with the addition of a new variable prev_class_name and some conditional statements regarding class_name.

Potential problems:

Checking if the data types of src and dst tensors are the same before copying, as implemented in the copy method, might lead to a loss of data if they do not match.
The changes in _replace_module might mess with the naming convention and create unexpected behavior if the naming convention is changed in the future.
Importing the transformers package and setting the variable OPTLearnedPositionalEmbedding could generate issues if the transformers package is not installed or in case an older version is being used that does not support the .models.opt.modeling_opt attribute.
The newly added test does not have any assertions to verify the outputs, which might not help identify some issues that might exist in the implementation.

cc https://github.com/microsoft/DeepSpeed/pull/3457

flows-network / review-any-pr-with-chatgpt