catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

Improve performance of Ex. 21 extraction model #68

Open katie-lamb opened 3 weeks ago

katie-lamb commented 3 weeks ago

Now that we have a validation framework for Ex. 21 extraction, try these simple improvements and re-evaluate performance.

### Tasks
- [ ] Investigate CorpWatch dataset to see if we can use anything there for validation
- [x] Get to 50 validation dataframes, high overlap with training data
- [x] Strip company name parts (LLC, Co, etc.) from names before doing similarity comparison
- [x] Retrain LayoutLM with more training data
- [ ] Look at the diff between computed and validation dataframes
- [x] Handle null values in pecision and recall metrics
- [ ] Log metric with the number of docs in the training set, or log the training set metadata
- [ ] Label more training set documents in Label Studio (get to 100 docs)
katie-lamb commented 2 weeks ago

Hey @zschira , do you know if there's a way to log the size of the training data set (number of docs) in MLFlow for each LayoutLM fine tuning run? Since near term improvements will likely be attributed to increasing the size of the labeled training set, it would be a good variable to log. I saw that inside the log_model util function we log mlflow.transformers.log_model( model, artifact_path="layoutlm_extractor", task="token-classification" ), so there's not a ton of customization in logging during a training run. Maybe the thing to do is just log the training set size as a parameter before this log_model call?