asteroid-team / asteroid

The PyTorch-based audio source separation toolkit for researchers
https://asteroid-team.github.io/
MIT License
2.26k stars 422 forks source link

[CI] Add version compatibility tests for pretrained models #294

Open mpariente opened 3 years ago

mpariente commented 3 years ago

🚀 Feature

Include version compat tests for pretrained models (models from asteroid 0.3.0 still work as expected in 0.3.3).

Motivation

We can break things between versions and still pass CI, this is not ok. For example #255 (solved by #258 )

Questions & remarks

jonashaag commented 3 years ago

Good idea. I actually prefer the reference file solution because it is the most flexible (eg. you decide when to update the reference) and is probably much much easier from an environment/dependency point of view... I don't fancy maintaining installation processes of ancient versions of Asteroid and its dependencies :)

Maybe we can simply add a reference to each model export automatically? So the exported model ships an example along with the weights. (Maybe even add a README etc...) Or we can have the references in a separate folder/repo in the Asteroid codebase.

How Huggingface is doing it: https://github.com/huggingface/transformers/blob/master/tests/test_modeling_gpt2.py#L480 Looks like they do it using a reference shipped with the code base, so not tied to a pretrained model per se. I wonder how they do it for user-uploaded models though...

mpariente commented 3 years ago

Maybe we can simply add a reference to each model export automatically? So the exported model ships an example along with the weights. (Maybe even add a README etc...) Or we can have the references in a separate folder/repo in the Asteroid codebase.

I thought about shipping examples with model sharing. We can't upload examples that come from wsj0 though. So random examples probably. All the Zenodo records can have a v2 and we could add a mix/estimates folder. That should weight too much.

How Huggingface is doing it: https://github.com/huggingface/transformers/blob/master/tests/test_modeling_gpt2.py#L480 Looks like they do it using a reference shipped with the code base, so not tied to a pretrained model per se. I wonder how they do it for user-uploaded models though...

It was on my todo-list to check how they do. They can afford writing input/output by hand, but for us, it's not feasible.

jonashaag commented 3 years ago

That should weight too much

Typo?

Didn’t think about the license issue, good point :( Otherwise I’d have suggested to simply extract an example automatically during evaluation.

They can afford writing input/output by hand, but for us, it's not feasible.

Why not? We can’t put the entire input/output in the tests but some proxy (a few samples only, SI-SDR improvement value, ...). We can even autogenerate the test code and print to console so that people just have to copy and paste to a unit test. (I still think shipping examples with the models, and not in the code base, is more flexible and powerful.)

mpariente commented 3 years ago

Typo?

Yes. I meant "It shouldn't weight too much".

Seems cumbersome, let's ship examples in the models' tarballs.

From the LDC User agreement.

User and User’s Research Group may include limited excerpts from the LDC Databases in articles, reports and other documents describing the results of User’s non-commercial linguistic education, research and technology development.

If we always select the same sample, we can share a real one.

mpariente commented 3 years ago

So we agreed to add the test files in each record. I'm now wondering where in the record.

Should it be (1) saved in a different file (that we would then have to download separately), or (2) saved as a tensor in the model file itself? Adding it in the model file itself is the most practical for now:

OTHO, if we save in different files, we either have to upload several files, or zip/tar the whole and upload it (change the upload/download code either way). The good thing about that approach is that anybody can listen to the input/output examples (but requires to use meaningful examples). And we'll be able to use this approach to upload convergence plots or other things in the future.

In ESPNet's model zoo, they upload a zip that contains several things, which is quite cool. This is what it looks like: image

mpariente commented 3 years ago

Last important thing to have in mind: when the files are modified, a new version of the record is made. So, if we decide that (2) is not good enough later in time and decide to change it, records will still have a version as (2) that will work with the corresponding asteroid version. All of this to say, backward compat is not a real problem here.

mpariente commented 3 years ago

While we're at it, I can also add the sample_rate.

import torch
from pathlib import Path

from asteroid import available_models
from asteroid.models.base_models import BaseEncoderMaskerDecoder
from asteroid.utils.hub_utils import cached_download, SR_HASHTABLE

for m_path in available_models():
    # Load dic and model
    f = cached_download(m_path)
    dic = torch.load(f)
    dic["model_args"]["sample_rate"] = SR_HASHTABLE.get(m_path)
    model = BaseEncoderMaskerDecoder.from_pretrained(dic)

    # Predict on random mix
    u = torch.randn(1, 8000)
    v = model(u)

    # Add new tests field
    dic["tests"] = dict(mixture=u, estimates=v)
    save_path = Path(f).parent.parent / '_'.join(m_path.split('/')) / "model.pth"
    save_path.parent.mkdir(parents=True, exist_ok=True)
    torch.save(dic, save_path)

This works. I'll "just" have to ask the uploaders to update their records and we'd be mostly done. However, we can't listen to the separated results, and the mixture/estimates are senseless. How do you feel about that @jonashaag @mhu-coder @JorisCos?

I don't think I'll take time to do (1) in the very near future and I feel like going forward without compatibility test is quite dangerous..

mhu-coder commented 3 years ago

Actually (1) and (2) are not necessary incompatible. We could put senseless I/O data in the model for the sole purpose of testing backward compatibility and keep the meaningful data (audio, plot, etc) in a separate files. The whole thing could then be tarred/zipped.

I wonder though, is a single data point sufficient to check backward compatibility? I mean, do we want to make sure that the shape at the output of a model, for input data of different shapes, will not change even when the code for said model changes? I am asking because, while working on #293, I realized that we broke this aspect when we added jit compatibility. If we test for various input shapes, dic["tests"] could be a dictionary of mixture/estimate pairs.

jonashaag commented 3 years ago

(Oops.)

Maybe naive question: Why are we using Zenodo in the first place? I guess as soon as we start having revisions of models and multiple files etc. we will start implementing a tiny version control system... so couldn't we just use Git to track the pretrained models? Then we could also add as many examples and test files and scripts etc. as we want. (This is obviously inspired by what Huggingface is doing :))

mhu-coder commented 3 years ago

Maybe naive question: Why are we using Zenodo in the first place? I guess as soon as we start having revisions of models and multiple files etc. we will start implementing a tiny version control system...

Good question :thinking: ... If we end up version controlling models, we could have a go at git-lfs.

That said Zenodo has the advantage of having a browser-based GUI while git-lfs requires users to install a client on their machine if they want to access the data. Plus, there's a limit of 1GB of free storage on github.

mpariente commented 3 years ago

Why are we using Zenodo in the first place?

Because I didn't find a better alternative for free. It has a versioning system, each model can be cited, models can be browsed. It was the closest I could get to huggingface's model hub (which is great) with my limited time and resources. But if there is a better way, I'd be happy to switch.

I guess as soon as we start having revisions of models and multiple files etc. we will start implementing a tiny version control system... so couldn't we just use Git to track the pretrained models?

I'm not sure I understand what you mean, Zenodo comes with versioning. If I update the files in the record, I just need to update the URL in URL_HASHTABLE and that's it. (But the old URL still points to the old model)

jonashaag commented 3 years ago

Hmmm... probably Zenodo is good enough. Using Git might be overkill. You don’t update the model every day, probably only every few months at most.

I think uploading zip files to Zenodo is the best option, even if it’s a bit more involved to implement. I can invest some time into this in the next days to get a better understanding of what needs to be done, if you’d like me to :) Dealing with zip files in Python is pretty easy but we'd have to adapt all of the code for downloading, uploading and testing.

mpariente commented 3 years ago

I don't see the limitation of Zenodo for our purpose. Yes, live demo would be AMAZING, but that's out of reach for now.

Also, updating models is confusing. I've seen this in challenges when the baseline model is updated without notice and suddenly people are very confused etc.. I'd rather have an explicit v2 in the name if the weight are changed, until we have clear model comparison tools.

Please, do invest time if you feel like it :-) In situations like these, I really want to get rid of the recipes.. The code for downloading/uploading is fine but updating the recipes is boring.

Let's agree, what should we put in the .zip?

jonashaag commented 3 years ago

Please, do invest time if you feel like it

Of course I didn’t get around to doing it 🙄 probably also not this week.