Open juntao opened 1 year ago
flows summarize
Hello, I am a code review bot on flows.network. Here are my reviews of code commits in this PR.
Key changes in this patch involve adding a new unit test case for shard checkpoint loading in AutoTP (Auto-Tensor Parallel) and modifying the replace_module.py
file.
A new unit test case TestCheckpointShardinAutoTP
has been added to the test_checkpoint_sharding.py
file. This test downloads model checkpoints, writes a JSON file with checkpoint paths and then loads the model on meta-tensors while initializing the inference.
A new variable OPTLearnedPositionalEmbedding
has been added to the load_layers
list in replace_module.py
.
A function, load_buffer
, has been modified to copy the buffer data only if the data is not already in the destination buffer. It checks whether the src
and dst
data types and tensors are the same before copying the data.
The _replace_module
has been updated with the addition of a new variable prev_class_name
and some conditional statements regarding class_name
.
Potential problems:
Checking if the data types of src
and dst
tensors are the same before copying, as implemented in the copy
method, might lead to a loss of data if they do not match.
The changes in _replace_module
might mess with the naming convention and create unexpected behavior if the naming convention is changed in the future.
Importing the transformers package and setting the variable OPTLearnedPositionalEmbedding
could generate issues if the transformers package is not installed or in case an older version is being used that does not support the .models.opt.modeling_opt
attribute.
The newly added test does not have any assertions to verify the outputs, which might not help identify some issues that might exist in the implementation.
Hello, I am a code review bot on flows.network. Here are my reviews of changed source code files in this PR.
deepspeed/module_inject/replace_module.py
Here are some potential issues and suggestions for improvements in the provided code:
In the
ReplaceWithTensorSlicing
class,merge_assert()
method takes two arguments,dim1
anddim2
, followed by an assert statement. Consider adding custom error messages to assist the user in understanding what caused the error.In the
ReplaceWithTensorSlicing
class,strided_copy()
andcopy()
methods have a lot of nested if statements. Breaking down these methods into smaller, more specific methods can help improve the readability and maintainability of the code.In the
strided_copy()
method, this linedst = dst.reshape(-1).data.copy_(src.data.reshape(-1)).reshape(src.shape)
can be broken down into multiple steps to improve the readability.In the
GroupQuantizer
class, thequantize()
method has several tensor reshaping and transformations. Consider breaking down this method into smaller helper functions to improve readability and maintainability.In the
replace_transformer_layer()
function, the global variablecontainer_g
is being used. Using global variables can cause issues related to variable scope and data sharing between functions. Consider refactoring the code to avoid using global variables.Several functions in the code, such as
replace_with_policy()
,replace_wo_policy()
, and other nested functions, have many parameters and contain a significant amount of logic. Consider breaking these functions down into smaller helper functions to improve readability and maintainability.In the
replace_transformer_layer()
function, some variables are initialized but not used (for example,linear_layer_setting
,micro_batch_size
,seed
, andlocal_rank
). It is better to remove the unused variables or comment them out if they will be needed later.Consider using more descriptive variable names in some cases, such as
mp_replace
, which could becometensor_slicing_replacer
. More meaningful variable names can make the code more readable and easier to understand for other developers.Add docstrings and type hints to functions and methods, especially those with complex logic or multiple parameters, to help users and other developers understand their purpose and how to use them correctly.
Some parts of the code have deep levels of indentation. To improve readability, consider using helper functions to break the code into smaller, more manageable pieces.
The key changes in the provided patch are:
In the
copy()
function, a more concise assignment fordst
is used:A
replaced
attribute is set to True after replacing children in the_replace()
function within theGroupQuantizer
class:The
_replace_module()
function includes a new argumentprev_class_name
and changes some variable settings to facilitate better renaming of modules:OPTLearnedPositionalEmbedding
from thetransformers
package is added as one of theload_layers
in the_replace_module()
function, improving compatibility:In the
load()
function, an alternativestrided_copy()
is used instead of theqkv_copy()
function, with the parameternum_splits
set to 3:tests/unit/inference/test_checkpoint_sharding.py
In general, the code is quite readable and well-structured. There are a few observations and areas where improvements can be made:
save_shard
class, it's better to add an__init__
method to initialize class variables likeworld_size
. Also, ifclass_tmpdir
is only initialized once, it's better to store it as a class variable too.In the
save_shard
class,run
method: Please add a docstring to provide a brief description of the method's functionality and its arguments.In the
TestCheckpointShard
andTestCheckpointShardinAutoTP
classes, it's better to add an__init__
method to initialize theworld_size
class variable like in thesave_shard
class.In the
test
method ofTestCheckpointShardinAutoTP
class, thewrite_checkpoints_json
function may be better suited as a separate utility function outside the class, as it does not access any instance variables of the class. Additionally, please add a docstring to provide a brief description of the function's functionality and its arguments.For better code readability and organization, it may be a good idea to separate import statements by categories. Keep standard library imports, third-party imports, and application-specific imports in separate groups.
Overall, the code is well-written with proper comments and structured format. Making these few improvements should help in enhancing readability and maintainability.
The patch introduces a new class
TestCheckpointShardinAutoTP
that inherits fromDistributedTest
with aworld_size
of 2. Its purpose is to test checkpoint sharding functionality in AutoTP.Key changes in the patch:
Importing additional modules:
deepspeed.comm
asdist
,huggingface_hub.snapshot_download
, andtransformers.utils.is_offline_mode
.A new method
write_checkpoints_json
is introduced in thetest
method ofTestCheckpointShardinAutoTP
class. This method writes checkpoint JSON files to the specified directory. It downloads the required model only on the first process and ignores certain file types when downloading: ".safetensors", ".msgpack", and "*.h5".inf_config
is updated with different settings forreplace_with_kernel_inject
andcheckpoint
properties.The model is loaded on meta tensors in a new way by using
AutoConfig.from_pretrained
and wrapping the model creation in adeepspeed.OnDevice
context manager with dtype set totorch.bfloat16
.Overall, this patch expands the test coverage by introducing a new test class for checkpoint sharding functionality with AutoTP. The new class includes an additional function to write checkpoint JSONs and uses a different config and model loading process on meta tensors.
cc https://github.com/microsoft/DeepSpeed/pull/3457