facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.52k stars 726 forks source link

Possible feature and bugfix contributions from Microsoft research team's fork of Metaseq #726

Open mattmazzola opened 1 year ago

mattmazzola commented 1 year ago

We are a team at @microsoft Research that has a fork Metaseq repo with these additional features:

  1. New pipeline task to perform Knowledge Distillation via Log Probabilities using a modified Cross Entropy implementation.
  2. Improved inference script with added functionality such as ability to output logprobs/logits.
  3. Improvements to Training Stop Conditions
  4. Scripts to support Teacher data generation using Open AI Service
  5. Documentation system using Sphinx
    1. Documentation of Co-Teaching training process (https://arxiv.org/pdf/2305.02031.pdf)
  6. Improved evaluation configuration to evaluate with different metrics depending on dataset
  7. Miscellaneous Bug Fixes
    1. jsonl_dataset.py#_build_index properly accounts for multi-byte characters.

Questions

We would be happy to answer any questions you have about the above components.

@tupini07

suchenzang commented 1 year ago

@mattmazzola Sorry for delay - I've been on PTO; would be interested in all of the above contributions as they come online (deferring to you on what the best ordering here would be)!

mattmazzola commented 1 year ago

interested in all of the above contributions

Ok! I will talk with rest of team and see what we want to do.

We are trying to roll off our current work and transition to another project so it is not clear how much time we be able to spend these contributions. This creates a kind of trade-off / conflict between wanting the larger items for impact, but smaller items for less commitment.

deferring to you on what the best ordering here would be

These fixes and features from our fork has some non-trivial divergence from metaseq main so it's less easy to judge how much work until we see how many merge conflicts there are. It also makes testing difficult or not possible since our infrastructure was using different dependency set running Azure Machine Learning environment.

The list above was an ordered by estimate of how impactful the PR contributions would be to Metaseq; however, given the difficulties I was trying to create PRs with inverse order to increase likelihood they merge. Beginning with the smallest / easiest since they were least likely to break something and wouldn't rely on as much help.

I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable. I think to be safest the PR or branch could be taken over by a core maintainer and verified.

suchenzang commented 1 year ago

These fixes and features from our fork has some non-trivial divergence from metaseq main... I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable.

That makes a lot of sense - feel free to open up PRs in whatever state you have them; they will be a useful starting point for figuring out how to merge / test them and pull into main over time.

mattmazzola commented 1 year ago

I have created PRs for all of the items on the initial issue list (except for item 4) and referenced this issue. Hopefully these can help improve Metaseq. Perhaps someone will continue exploration of the "soft" distillation technique in the future.