Critical Code Analysis {backend.py}

Context: This issue contains the outputs of an LM-based code analysis tool ran by @versoindustries on a part of Petals code. [comment added by @borzunov]

\\ GPT-3|Codex|ChatGPT Unofficial ChatGPT API /// ~ Fine Tuned Complex Model GPT Model wrapped into a program, and vs extension used to analyze codebases at a reasonable cost~ Will slowly add in other files. I haven't dove all the way into the codebase, but as I do the model will get a clearer picture of what's going on. Some of these may not actually be issues if it's being used in a library or source that's not known to the model yet.

The assert not param.requires_grad and assert not buf.requires_grad checks in the constructor of the TransformerBackend class may cause problems if the model's parameters or buffers are expected to accumulate gradients.

The inference_pool and forward_pool variables are being assigned the same instance of PrioritizedTaskPool, which could lead to unexpected behavior when processing requests for forward and inference.

The self.cache_bytes_per_token variable is being assigned the value of Counter() which is not being used later on the code and is not being used for any operations.

The max_batch_size variable is being used in the constructor of PrioritizedTaskPool but it is not defined in the code.

The self.shard_num_heads variable is being used without being defined or assigned any value before.

The backward_pool variable is defined but it is not being used in the code.

The self.inference_schema variable is defined but it is not being used in the code.

The class TransformerBackend inherits from ModuleBackend but it is not being used.

The import of BloomConfig is not being used in the code.

The import of BloomAttention is not being used in the code.

The self.dtype variable is being used but it is not being defined or assigned any value before.

The self.memory_cache variable is defined but it is not being used in the code.

The import of InferenceMetadata is not being used in the code.

The import of Handle is not being used in the code.

The import of is_dummy is not being used in the code.

The self.forward_pool and self.inference_pool are being defined with the same PrioritizedTaskPool instance, which could cause confusion and unexpected behavior when handling forward and inference requests.

The self.forward_pool and self.backward_pool are defined with the same max_batch_size variable, which is not defined in the code. It might cause an error if this variable is not passed as an argument.

The self.config variable is defined in the constructor but it is not used anywhere in the code.

The *args and **kwargs passed to the constructor are not used in the code, which might cause confusion and unexpected behavior if they are passed with specific values.

The from future import annotations statement at the top of the code is not needed and does not affect the execution of the code in any way.

The self.inference_schema variable is defined but it is not used in the code. It is unclear if it is intended to be used for validation or documentation of the input and output schema of the inference_step method.

The self.cache_bytes_per_token variable is defined but it is not used in the code. It is unclear if it is intended to be used for memory management or performance optimization.

The self.get_inference_cache_descriptors method is defined but it is not used in the code. It is unclear what its intended purpose is and how it is related to the self.cache_bytes_per_token variable.

The batch_size and max_length arguments passed to the self.get_inference_cache_descriptors method are not used in the code. It is unclear if they are intended to be used for memory management or performance optimization.

The self.dtype variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

The self.shard_num_heads variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.memory_cache variable is defined in the constructor but it is not used in the code. It is unclear what the intended use of this variable is.

There is no clear error handling mechanism in the code. If an error occurs, it might go unnoticed and cause unexpected behavior.

The code is not commented, which makes it difficult to understand the intended behavior and the meaning of the variables and methods.

The code is not well organized, making it difficult to understand the flow of execution and the dependencies between the different parts of the code.

The code is not written in a modular or reusable way, making it difficult to reuse parts of the code for other projects or applications.

he self.forward, self.backward and self.inference_step methods are being used in the constructor of the TransformerBackend class, but they are not defined in the code. It is unclear how these methods are supposed to work and what their intended behavior is.

The self.inference_pool variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

The self.args_schema, self.kwargs_schema variables are used in the constructor but they are not defined in the code. It is unclear what their intended use is.

tensor_parallel and petals are not standard python library and are not clear what they are and what they are used for.

The self.inference_schema variable is defined but it is not used in the code. It is unclear what the intended use of this variable is.

Overall the code seems to have been in the middle of development, and not ready for use. It has multiple issues that need to be addressed and cleaned up before it can be used in any real-world scenario.

|||||||||||||||||||||||||||||||||||||||||||

[copied from Discord conversation]

Among the reported issues, I have not been able to find at least one that is actually present in the code.

For example, the tool complains about many imports and class fields being unused, however, once you open the project, it becomes clear that they are used many times in this file and/or the project overall.

Yes, I got the model broken down into tokens and have it submitted into the fine tuning channel. I'm aware a lot of these issues are rather benign and are used through the project, but when it's on one file it doesn't know that. Been working on the resolution to that and so far it's working better. Significantly less problems with the code registering now. GPT-3 is now aware of the Petals p2p system and its code, along with fine tuning for various other projects. Had to implement a way to parse the whole code base.

On Mon, Jan 23, 2023 at 5:26 AM Alexander Borzunov @.***> wrote:

[copied from Discord conversation]

Among the reported issues, I have not been able to find at least one that is actually present in the code.

For example, the tool complains about many imports and class fields being unused, however, once you open the project, it becomes clear that they are used many times in this file and/or the project overall.

— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/petals/issues/228#issuecomment-1400261851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHQ5I5YYIH3LRCRTGY6F33WTZ2HRANCNFSM6AAAAAAUBIXMD4 . You are receiving this because you were mentioned.Message ID: @.***>

The class TransformerBackend inherits from ModuleBackend but it is not being used. The max_batch_size variable is being used in the constructor of PrioritizedTaskPool but it is not defined in the code.

It seems that your LM-based suggestions try to take several responsibilities of a "static" code analyzer, like the ones built into pycharm or vscode. To the best of my understanding, LM-based analysis for things like unused variables will be inferior to traditional static analysis. Notably, both of the quoted statements are false - as well as all other similar statements I found. And a static analyzer would be able to find those with better accuracy guarantees.

As a potential user, I see this as "static analysis, but sometimes incorrect", which is not very valuable. However, I see several avenues where this thing could provide more value: #### 1. General comments with explanations? Perhaps it would be best to leave these suggestions to static analyzer - i.e. remove them from model outputs completely - and focus on suggestions that can't be done algorithmically. For instance, consider these suggestions: > The code is not commented, which makes it difficult to understand the intended behavior and the meaning of the variables and methods. > The code is not written in a modular or reusable way, making it difficult to reuse parts of the code for other projects or applications. In their current form, these statements don't help me-the-user to improve the code, because I don't know which specific part of the code triggered the model to write this. So, not "The code is not commented", but "This specific part of code is not commented", and maybe "Here are some suggestions on which comments you might add". This is something that (1) can help developers (2) using LM's generative propertied (3) that cannot be easily done by a static analyzer. #### 2. Provide context for the static analyzer Consider things that static analyzer is bad at - and whether or not the generative model can help: - If some issues are not human-readable, use LM to give explanations. For instance, if you miss a ")" or "]" before you begin a for loop, static analyzer could complain that the for loop cannot be used inside the previous line (with missing "]"). If the model takes this issue and adds "maybe you're missing a closed bracket in this earlier line?", that would help - If some issues are difficult to address, maybe use the model to generate potential solutions. For instance if a fuzzer complains about a variable that is not always defined, use LM to suggest how to patch the code to fix that. Perhaps there is a way to augment the model with "static analyzer" suggestions? One way is to feed static analyzer as additional inputs to the model - and train it on human comments. That way, the model would be able to "explain" static analyzer comments.

Sure, I could agree with that. So far I have it implemented into VSCode finally and it's able to view the call stuck and debugging runs. I will update this file analysis once the repo is done parsing ~~around 43% right now.

On Mon, Jan 23, 2023 at 9:15 AM justheuristic @.***> wrote:

The class TransformerBackend inherits from ModuleBackend but it is not being used. The max_batch_size variable is being used in the constructor of PrioritizedTaskPool but it is not defined in the code.

It seems that your LM-based suggestions try to take several responsibilities of a "static" code analyzer, like the ones built into pycharm or vscode. To the best of my understanding, LM-based analysis for things like unused variables will be inferior to traditional static analysis. Notably, both of the quoted statements are false - as well as all other similar statements I found. And a static analyzer would be able to find those with better accuracy guarantees.

As a potential user, I see this as "static analysis, but sometimes incorrect", which is not very valuable. However, I see several avenues where this thing could provide more value:

General comments with explanations?

Perhaps it would be best to leave these suggestions to static analyzer - i.e. remove them from model outputs completely - and focus on suggestions that can't be done algorithmically. For instance, consider these suggestions:

The code is not commented, which makes it difficult to understand the intended behavior and the meaning of the variables and methods. The code is not written in a modular or reusable way, making it difficult to reuse parts of the code for other projects or applications.

In their current form, these statements don't help me-the-user to improve the code, because I don't know which specific part of the code triggered the model to write this. So, not "The code is not commented", but "This specific part of code is not commented", and maybe "Here are some suggestions on which comments you might add" - since you're using a generative model

Provide context for the static analyzer

Consider things that static analyzer is bad at - and whether or not the generative model can help:

-

If some issues are not human-readable, use LM to give explanations. For instance, if you miss a ")" or "]" before you begin a for loop, static analyzer could complain that the for loop cannot be used inside the previous line (with missing "]"). If the model takes this issue and adds "maybe you're missing a closed bracket in this earlier line?", that would help

If some issues are difficult to address, maybe use the model to generate potential solutions. For instance if a fuzzer complains about a variable that is not always defined, use LM to suggest how to patch the code to fix that.

Perhaps there is a way to augment the model with "static analyzer" suggestions? One way is to feed static analyzer as additional inputs to the model - and train it on human comments. That way, the model would be able to "explain" static analyzer comments.

— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/petals/issues/228#issuecomment-1400612759, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHQ5I6J5PTH7HGKBV27TBDWT2VB7ANCNFSM6AAAAAAUBIXMD4 . You are receiving this because you were mentioned.Message ID: @.***>

Closing this due to inactivity.

bigscience-workshop / petals

Critical Code Analysis {backend.py} #228