Improve performance of Ex. 21 extraction model

catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage

MIT License

0 stars 0 forks source link

### Tasks - [ ] Investigate CorpWatch dataset to see if we can use anything there for validation - [x] Get to 50 validation dataframes, high overlap with training data - [x] Strip company name parts (LLC, Co, etc.) from names before doing similarity comparison - [x] Retrain LayoutLM with more training data - [ ] Look at the diff between computed and validation dataframes - [x] Handle null values in pecision and recall metrics - [ ] Log metric with the number of docs in the training set, or log the training set metadata - [ ] Label more training set documents in Label Studio (get to 100 docs)

Hey @zschira , do you know if there's a way to log the size of the training data set (number of docs) in MLFlow for each LayoutLM fine tuning run? Since near term improvements will likely be attributed to increasing the size of the labeled training set, it would be a good variable to log. I saw that inside the log_model util function we log mlflow.transformers.log_model( model, artifact_path="layoutlm_extractor", task="token-classification" ), so there's not a ton of customization in logging during a training run. Maybe the thing to do is just log the training set size as a parameter before this log_model call?

catalyst-cooperative / mozilla-sec-eia

Improve performance of Ex. 21 extraction model #68