Data priority, incremental training?

argosopentech / argos-train

Training scripts for Argos Translate

https://www.argosopentech.com

MIT License

118 stars 21 forks source link

Data priority, incremental training? #22

Open JanCizmar opened 1 year ago

JanCizmar commented 1 year ago

Hi there!

I would like to use the data currently provided in data-index.json, but at the same time, I would like to use my custom data. Can I tell the script to generate a model considering my custom data is more relevant / has a bigger priority?
Let's say I have one large dataset I am using all the time, and then I have multiple smaller datasets which I would like to train different models for each. Is something like an incremental build possible, so I would reuse some previous output and just "append" my custom data to save some training time and resources?

Thanks!

PJ-Finlay commented 1 year ago

There's no direct support for this but you can accomplish this by modifying argostrain/train.py.

I would add input("Downloaded Argos Data") after the data has been downloaded here and then append your custom data to run/source and run/target.

You could also train one base model and then fine tune it using custom data. However, this will also require using custom code.

I want to improve using custom data and fine tuning so if anyone has suggestions or pull requests they're appreciated.

martin-leoorg commented 1 year ago

Would incremental training also be possible with the suggestions from libretranslate? I think the base models that are available are quite good already, but having the feedback from libretranslate incorporated might make corner cases even better - this might depend on the actual use case (e.g. a medical use case might need a different fine-tuning than a scuba-diving one, to pick random examples).

Having a possibility to quickly improve the base model without having to use a high-power machine for training the complete model again with 99.9% same input data would be great!