Open mattmazzola opened 1 year ago
@mattmazzola Sorry for delay - I've been on PTO; would be interested in all of the above contributions as they come online (deferring to you on what the best ordering here would be)!
interested in all of the above contributions
Ok! I will talk with rest of team and see what we want to do.
We are trying to roll off our current work and transition to another project so it is not clear how much time we be able to spend these contributions. This creates a kind of trade-off / conflict between wanting the larger items for impact, but smaller items for less commitment.
deferring to you on what the best ordering here would be
These fixes and features from our fork has some non-trivial divergence from metaseq main so it's less easy to judge how much work until we see how many merge conflicts there are. It also makes testing difficult or not possible since our infrastructure was using different dependency set running Azure Machine Learning environment.
The list above was an ordered by estimate of how impactful the PR contributions would be to Metaseq; however, given the difficulties I was trying to create PRs with inverse order to increase likelihood they merge. Beginning with the smallest / easiest since they were least likely to break something and wouldn't rely on as much help.
I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable. I think to be safest the PR or branch could be taken over by a core maintainer and verified.
These fixes and features from our fork has some non-trivial divergence from metaseq main... I think I may be able to at least submit PRs to share the ideas, but they may not be directly mergeable.
That makes a lot of sense - feel free to open up PRs in whatever state you have them; they will be a useful starting point for figuring out how to merge / test them and pull into main over time.
I have created PRs for all of the items on the initial issue list (except for item 4) and referenced this issue. Hopefully these can help improve Metaseq. Perhaps someone will continue exploration of the "soft" distillation technique in the future.
We are a team at @microsoft Research that has a fork Metaseq repo with these additional features:
jsonl_dataset.py#_build_index
properly accounts for multi-byte characters.Questions
We would be happy to answer any questions you have about the above components.
@tupini07