Closed parmeet closed 3 years ago
For audio, we have Wav2Vec2Model
, which is empty and generic [code].
We use factory functions to instantiate a variant of the model.
Now, this model and its paper, wav2vec 2.0 is a first kind of effective pre-training in audio domain, so we do not have many famous fine-tuned models yet.
What I have decided, is that the main focus is to provide the way to instantiate the model for pertaining configuration. (with or without loading pertained weights). So the Wav2Vec2Model
is expected to have feature extractor
module and encoder
module.
The paper also experimented with ASR task as one of fine-tuning. Since this is just one example of fine-tuning, we treat this model architecture as a special case. (even though it's the result of this fine-tuning that proved the significance of the method)
We added factory functions to build this variant of architecture, putting the extra module called aux
alongside the encoder
and feature extractor
, so that we do not introduce another class. (minimalistic approach)
Now, what I expect users for custom downstream adaptation is that they will use our pre-training Wav2Vec2Model as a part of their huge model.
class MyModel:
def __init__(self):
self.wav2vec2 = Wav2Vec2Model (pertaining)
self.downstream = ExtraModuleForDownStreamTask
This will eliminate the need to perform complicated model surgery. But this is possible because, audio domain just started to adopt pre-training in a wild.
Note: HuggingFace transformers have the same model/weights and they indeed have different classes for pertaining and fine-tuning. Worse, some of them forces user to have the exact one configuration of downstream model and issue warning like "one of the module is not trained, please do training.". As a domain library, I decided to implement the minimal component that can recreate what HG does, while keeping the number of classes we need to maintain.
@mthrok @parmeet Thanks for bringing this up.
I think there might be a misunderstanding possibly caused by the fact that I have not highlighted model composition on the special cases part of the RFC.
BTW I provide a demo implementation for how to support model composition. Here is a sample implementation of the current proposal: https://github.com/datumbox/dapi-model-versioning/blob/e5a50a6cc8d99bce673c531184fb88c1cabdce6a/dapi_lib/models/faster_rcnn.py#L77-L80
To summarize what happens on Vision (which I think aligns with your cases), the Classification architectures can be seen as Encoders. We support over 13 families of architectures which translates to 59 models. Then we support at least 4 families of Object Detection heads which in principle can be combined with any of the 59 encoding models (similar for Segmentation etc). Though in theory it is possible to combine all encoders with all head specific tasks, in practice it's completely impractical:
For the above reasons, our practice is to offer a limited number of pre-trained weights. This means only a handful of canonical models are provided so the number of builders doesn't explode. Note that this doesn't mean that our users can't combine models as they want. Our model classes, usually receive a backbone
parameter to allow for arbitrary combinations. We even provide an example on our documentation on how to do achieve this.
So my proposal is to keep the number of model builder methods limited by providing only the canonical combinations that appear on the papers. At the same time, structure your Model classes in a way that things can be combined. This way you get the best of two worlds. Please let me know what you think.
@mthrok @parmeet Thanks for bringing this up.
I think there might be a misunderstanding possibly caused by the fact that I have not highlighted model composition on the special cases part of the RFC.
BTW I provide a demo implementation for how to support model composition. Here is a sample implementation of the current proposal:
Thanks @datumbox for surfacing this. The builder function here optionally accepts only Resnet_50 backbone. What if the users want something else? I understand and as you noted below, arbitrary combinations are rare in vision and you tend to provide explicit factory functions for combinations which are published in literature. This is exactly where we differ in our requirement. In text (limiting this discussion for transformer based models) we have three kinds of backbone: encoder, decoder and encoder-decoder. encoder based architectures are more suitable for classification based tasks eg: sequence classification, Parts of speech tagging, and Q&A. Whereas, decoder based architectures are more suitable for generation based tasks like summarization and language generation. Typically a user can attach any backbone model with any task which is suitable for that task. Refer to this HF paper for additional details. This means that there is no such thing as "popular combination" (though naturally some combinations are best for the task at hand) and in principle people can explore various combos. Now, to keep things simple ideally every backend model comes with their own implementation of task specific heads. So roughly speaking, every model class (Roberta, XLMR , BERT etc) implements the encode part and task specific heads. This is a general trend, as researchers want to show how good their pre-trained backbone is by demonstrating performance on several downstream tasks (refer to [GLUE(https://gluebenchmark.com/) and SUPERGLUE benchmarks).
Just to be clear, I think we can easily implement this composition for your proposed framework, by simply implementing task specific builders (encoder+task)(this is exactly what HF did, refer eg in the issue above) and creating corresponding Enum entries. And this is a completely viable solution.
My concerns here are: 1) Explosion in number of user facing APIs and builder functions (number of backbones x number of tasks) EDIT: (number of backbones x number of architectures x number of tasks) 2) Redundancy in Weights (which would be exactly same for different builders). Most of the times, people only want pre-trained weights for back bone and keep task specific heads un-initialized to fine-tune them on their dataset. (I know now I are deviating from pre-trained definition because tasks-heads that user wants are uninitialized but this is the most dominant use-case in text)
Now, I totally understand that, I am deviating from your original proposal which is about model versioning and not about composition. I just wanted to surface these differences to ensure we are aware of some of the most critical needs on text modeling.
CC: @mthrok , as I feel the same trend might kick-in for audio as well
@parmeet Thanks for the detailed feedback.
arbitrary combinations are rare in vision
The more you describe the needs in Text the more I think they are identical to the ones of Vision. Arbitrary combinations are not rare in vision and they are compatible with this proposal!
I suspect that there is some misunderstanding, which lies on how the builder methods are used and how they are combined with Model Constructors. It's true that on the RFC I don't cover these mechanisms because they already exist and thus they are considered not part of this proposal (we only use them and build upon them). Perhaps that's what causes the confusion? Below I provide a high-level overview of how things link to each other. Let me know if that clarifies things.
Here are the main ways to build a model:
Note: The model builder methods exist prior to this RFC. The RFC focuses on how to introduce versioning in a BC-way.
Here are two examples of such builder methods:
These model building methods are:
Given the above, let's review some of the points and concerns you raised:
What if the users want something else?
If you want a model configuration that is very specific to your problem then you should not use this API. There is a better approach for that (see approach 2) that gives you full flexibility.
Explosion in number of user facing APIs and builder functions [...] Redundancy in Weights
The number of builders does not explode because we only support those for which we provide weights. Obviously the same applies for the Weights, which means there is no redundancy.
people only want pre-trained weights for back bone and keep task specific heads un-initialized
If you want to combine arbitrary backbones with new heads, this is not the right API for this use-case. There is a better approach that uses object composition (see approach 3).
Note: This is not covered in this RFC because it pre-exists.
Typically the model constructors can handle a huge amount of model variations and configurations. This API is used to build truly custom models.
Below we see 2 examples of this in vision:
One is able to produce variants (mobilenetv3_large
, mobilenetv3_small
etc) of a family of models (MobileNetV3
) or even create something new by adapting the architecture configuration, the type of Normalization/Activation layers etc:
# Class definition
class MobileNetV3(nn.Module):
def __init__(
self,
inverted_residual_setting: List[InvertedResidualConfig],
last_channel: int,
num_classes: int = 1000,
block: Optional[Callable[..., nn.Module]] = None,
norm_layer: Optional[Callable[..., nn.Module]] = None,
**kwargs: Any
) -> None:
pass
# How to use to produce arbitrary models:
architecture_config = [
bneck_conf(16, 3, 16, 16, True, "RE", 2, 1),
bneck_conf(16, 3, 72, 24, False, "RE", 2, 1),
bneck_conf(24, 3, 88, 24, False, "RE", 1, 1),
bneck_conf(24, 5, 96, 40, True, "HS", 2, 1),
...
]
model = MobileNetV3(architecture_config, 2048, num_classes=20, norm_layer=nn.ReLU)
In cases where model composition is required, one simply provides the desired backbone/encoder to the model constructor. Internally the class attaches tasks-heads, handles extra algo steps and does all the necessary things to produce predictions for the task:
# Class definition:
class FasterRCNN(GeneralizedRCNN):
def __init__(self, backbone, num_classes=None,
# transform parameters
min_size=800, max_size=1333,
image_mean=None, image_std=None,
# RPN parameters
rpn_anchor_generator=None, rpn_head=None,
rpn_pre_nms_top_n_train=2000, rpn_pre_nms_top_n_test=1000,
rpn_post_nms_top_n_train=2000, rpn_post_nms_top_n_test=1000,
rpn_nms_thresh=0.7,
rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,
rpn_batch_size_per_image=256, rpn_positive_fraction=0.5,
rpn_score_thresh=0.0,
# Box parameters
box_roi_pool=None, box_head=None, box_predictor=None,
box_score_thresh=0.05, box_nms_thresh=0.5, box_detections_per_img=100,
box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,
box_batch_size_per_image=512, box_positive_fraction=0.25,
bbox_reg_weights=None):
pass
# How to use to produce arbitrary models:
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# ...more customizations defined here...
model = FasterRCNN(backbone, num_classes=2)
So now it should be clear that if you structure your Model classes to receive the backbone/encoder and support composition, you can support arbitrary combinations without issues. Note that in both cases, the discussion of pre-trained weights makes no sense because the user is able to produce arbitrary architectures that need to be trained from scratch.
Note: A hybrid approach is possible by combining model builders and constructors. Such implementations are likely to differ from library to library and will depend on the task/model and hence this RFC does not impose you how to do it.
As mentioned on the document, this RFC tries to give guidelines but also give space to the DAPI libraries to handle their own implementation details based on their needs. Here I describe one way you could combine the above 2 existing mechanisms to build custom APIs for your users.
we can easily implement this composition for your proposed framework, by simply implementing task specific builders
Exactly, the proposed framework does not limit you and it's not opinionated. Let's see how you can use the proposal to support the use-case where the people use pre-trained weights for the backbone and keep task specific heads un-initialized.
This RFC tells you how to define builder methods for models that you offer pre-trained weights:
def resnet50(weights: Optional[ResNet50Weights] = None, progress: bool = True, **kwargs: Any) -> ResNet:
pass
def mobilenet_v2(weights: Optional[MobileNetV2Weights] = None, progress: bool = True, **kwargs: Any) -> MobileNetV2:
pass
The existing Model constructor API, allows you to pass backbones to task specific models. Thus you can create a task builder method that combines arbitrary backbones with a model that solves a specific task:
def fasterrcnn(backbone: nn.Module, **kwargs: Any) -> FasterRCNN:
# In practice some additional code might be needed here to adjust sizes/aspect_ratios etc but that's the rough idea
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)
return FasterRCNN(backbone, rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler, **kwargs)
# Usage:
backbone = mobilenet_v2(weights=MobileNetV2Weights. ImageNet1K_RefV1).features
model = fasterrcnn(backbone, num_classes=2)
Let's now address review the original concerns:
- Explosion in number of user facing APIs and builder functions (number of backbones x number of tasks)
I believe from the above example, it should be clear that this is not the case. You should be able to use the model builder methods of this RFC to create pre-trained backbones and then use standard object composition to keep the number of the task specific builders low. Instead of multiplicative the number of APIs grows linearly.
- Redundancy in Weights (which would be exactly same for different builders). Most of the times, people only want pre-trained weights for back bone and keep task specific heads un-initialized to fine-tune them on their dataset. (I know now I are deviating from pre-trained definition because tasks-heads that user wants are uninitialized but this is the most dominant use-case in text)
Not only there is no redundancy in weights but we don't even need to use a weighting mechanism in task builder methods for which we don't offer fully pre-trained weights. Such builders are beyond the scope of the RFC and in full control of the DAPI library to adjust to their needs. Finally the use-case of having pre-trained backbone with un-initialized heads is covered exactly above.
I think most of the misunderstandings are caused by the fact that the RFC document does not cover the existing model building methods. I'm happy to make the necessary changes to make this clearer but you are also welcome to send a PR. If I misunderstood any of your points please let me know. Thanks!
Thank you @datumbox for providing additional explanation with concrete examples. This make things even more clear that this RFC and Model composition are two different topics. To summarize what you said above:
1) The RFC is about available pre-trained models (however they may be: backbone or backbone with task) and their versioning.
2) Models can be easily composed arbitrarily with the hybrid approach (Approach 3) and is not restricted by this RFC. It does not constraint libraries in any way to facilitate composition and the details of how it is/should be done are left to the library maintainers.
3) There is no redundancy in weights because they are meant for end-2-end models. For user specific needs like: pre-trained weights for backbone and un-initialized weights for task is outside the scope of model versioning since it is not making much sense to provide pre-trained weights for backbone+task when only backbone pre-trained weights are required. The ideal place to handle this is at model composition level.
I think we are well aligned here :)
Closing this issue, as it should be now apparent that model composition is off-topic and the RFC doesn't restrict from implementing this per the library specific needs.
One of the dominant scenario for text is to use some pre-trained encoder (Roberta, BERT, XLMR etc) and attach task specific head on top of it (classification head, Language modeling head, POS tagging head, Q&A head etc). I believe this is also true for Vision as well (as well as to audio @mthrok ?). To the best of my knowledge (please correct me if I am am mistaken), vision currently provides factory function for every possible combination there-of? This approach is somewhat limiting in terms of scalability and boiler-plate code over-head that comes with it. Also versioning could be bit redundant if we replicate same weights class across each combination for the encoder part.
I wonder what folks think about extending this framework to support model composition?
As a reference HF also explicitly provide classes for every combination. Here is one example for Roberta Encoder + Q&A task.