Open crimson206 opened 3 months ago
I previously suggested setting up unittests. Returning the validation result along with the conversion will make it easier to write tests for the conversion process.
Developers will want to programmatically analyze the result of conversion in a more detailed manner. Therefore, I suggest returning the validation and conversion results as well after conversion.
Right. I believe that's how unit tests should be implemented.
Returning all that by default would be quite unwieldy, though. Huggingface's transformers
have a bunch of arguments that control which outputs need to be returned (output_attentions
, output_hidden_states
), as well as how the outputs are organized (return_dict
). Why not take the same route?
You are the owner of this project, which means you are responsible for maintaining a consistent structure. Users relying on the default settings might encounter issues if this implementation is merged directly. The return scheme you propose is quite common; I've seen it frequently in Deep Learning models. I understand your concerns. However, I personally disagree with them because many developers may have insufficient programming skills. The Hugging Face implementations are not entirely satisfactory either.
Let's deal with the first problem.
Please see the example below.
from typing import overload, Tuple, Union, Any
class Model:
"""
Description About the Model.
"""
def func_from_model():
pass
class ValidationResult:
"""
Description About the ValidationResult.
"""
status: str
@overload
def validate(accessory:Any) -> Tuple[Model, ValidationResult]:
pass
@overload
def validate() -> Model:
pass
def validate(accessory) -> Union[Model, Tuple[Model, ValidationResult]]:
pass
If we use the overload
, we can achieve both:
However, it has a limitation. Regardless of the value of accessory
, the output type is denoted as Tuple[Model, ValidationResult]
.
Therefore, it's better to use accessory as a complete signal.
def common(accessory:bool):
if accessory is True:
return Model(), ValidationResult()
else:
return Model()
def current_case(accessory:Any = "None"):
"""
If the `accessory` arg is detected in the function call syntax,
The Accessory is also returned.
Otherwise, this function returns only the Model.
"""
if accessory is "None":
return Model()
else:
return Model(), ValidationResult()
If we design the function like current_case
, the output and the annotation(type hint) will be in sync.
It is programmatically well-designed, but not common. Users might find it confusing.
If you're interested, you can further improve this methodology.
This is actually a parallel topic.
I'm not sure what you mean exactly by "unwieldy". If you think the Tuple type hint is unwieldy, I disagree.
Please see the example below:
from typing import Tuple, Dict
class KerasModel:
pass
class PytorchNode:
pass
class ValidationResult:
pass
class ConversionResult:
pass
def pytorch_to_keras(
) -> Tuple[KerasModel, object, Dict[PytorchNode, ValidationResult], Dict[PytorchNode, ConversionResult]]:
"""
Parameters:
Description about the KerasModel.
Descriptions continued...
...
"""
output = pytorch_to_keras()
Developers can read all the output types, and you can add additional descriptions as docstrings for more details.
There are typically three options:
Tuple is not unwieldy, but it might not the best option either. Each has its pros and cons. The choice depends on your preference.
I disagree that returning all outputs by default would be unwieldy. In fact, I believe it's more consistent and potentially less problematic. Here's why:
Consistency: Always returning the same structure ensures consistency across different use cases. Users don't need to worry about different return types based on input parameters.
Memory efficiency: We're not actually returning all the objects, but just their references. The memory impact of returning unused outputs is negligible in most cases.
Simplicity: Having a single, consistent return structure can make the API simpler to use and understand. Users don't need to remember which flags to set to get specific outputs.
Future-proofing: If we always return all possible outputs, adding new output types in the future won't break existing code.
Performance: In many cases, the computation of these outputs (like attention weights) happens anyway. Not returning them doesn't save computation time.
While Hugging Face's approach with output_attentions
, output_hidden_states
, and return_dict
offers flexibility, it also introduces complexity. Users need to remember these flags and their effects.
Instead of controlling which outputs are returned, we could focus on providing clear documentation and type hints. This way, users can easily ignore the outputs they don't need, while still having access to all information if required.
If memory usage is a concern in specific scenarios, we could provide separate methods or a context manager for memory-sensitive operations, rather than complicating the main API.
Tuple
Tuples are simple to implement. Neither the project owner nor users need to write any additional code.
It's not ideal if we're likely to add more outputs in the future. output[2]
from the previous version might be a different type in the new version.
Dict
You need to define a dictionary. For users, output[0]
is shorter than output['model']
.
At this cost, we gain better extensibility. Even if we add a new output, output['model']
is not affected.
It sounds good, but I don't recommend it. Let me explain the reason.
If you use the dictionary type, you'll type-hint like this:
output: Dict[
Literal["model", "tensor", "validation_result", "conversion_result"],
Union[
KerasModel,
object,
Dict[PytorchNode, ValidationResult],
Dict[PytorchNode, ConversionResult]]
]
If you check the autocompletion after output['model'].
, you'll see all the properties from KerasModel, object, and Dict.
This is definitely not ideal.
BaseModel
We can define the output in a more structured and detailed manner:
class OutputProps(BaseModel):
output1: int
output2: str
The IDE isn't confused about the type for each output name.
We can also write descriptions for each output using Field
.
There are two disadvantages:
It's tiresome: For each function, we need to define an additional class.
It hinders Intellisense functionality (critical): We can't read the inner structure of OutputProps from Intellisense. Compare the two pictures below. When I use models from Hugging Face, I find this very inconvenient.
As the Comparison
section implies, I prefer Tuple
the most.
pydantic.BaseModel
is also a valid option and quite common in many Deep Learning libraries.
Dict
is also a functionally good option, but it can't be considered "Clean Code".
If you're willing to use Tuple, I'd like you to check out one of my modules called intelli-type. It can be used to overcome some shortcomings of Tuple type output. If you're interested, I can also implement a version with it and push the commit.
There must be many unfamiliar aspects to you. I am an expert in Python type hints and Intellisense (limited to VSCode though), and this extends to Clean Code practices. Therefore, I can't ignore the potential code smells from different architectures. This is just my point of view. Clean Code is a double-edged sword and might not always be beneficial from an agile development perspective. If you find it tiresome to read through all the details, I sincerely suggest we proceed as follows:
Developers will want to programmatically analyze the result of conversion in a more detailed manner. Therefore, I suggest returning the validation and conversion results as well after conversion.
From the perspective of Clean Code, it's better to return a consistent type. See the examples below. If we use a
Union
as the return type, your IDE will be confused about the type of the returned variables.In the case of
func
, thesecond
variable must be anint
, but we can see that Intellisense recommends autocompletion from bothint
andstr
. In the case offunc2
, the autocompletion is limited to the designed targets.