Training dragnet model using TOML formatted data

dragnet-org / dragnet_data

code and data used to build a training dataset for dragnet models

MIT License

10 stars 2 forks source link

Training dragnet model using TOML formatted data #3

Closed TheTravellingSalesman closed 3 years ago

TheTravellingSalesman commented 3 years ago

Hey there!

How do you train dragnet models using this TOML formatted meta data?

Do you have to convert it back into a id.html.corrected.txt file?

bdewilde commented 3 years ago

Hey @TheTravellingSalesman , this is a new training dataset with a new format, and we've not gotten around to updating and retraining corresponding new models in dragnet. It's on my personal backlog, but haven't had time to carry it forward. (See: https://github.com/dragnet-org/dragnet/issues/85)

So, I don't have a specific answer for you, beyond: yes, we'll need to convert the data and/or update the training code so that the new data can be used to train a model.

TheTravellingSalesman commented 3 years ago

@bdewilde Thank you for the quick response! I'll look forward to the updates as they come out.

In the mean time, I'll just extract the text fields from the new training dataset and place them in id.html.corrected.txt files, and integrate them with my training data. Of course, an update to the training code would be pretty exciting.

Take care!

bdewilde commented 3 years ago

Hey, just following up to thank you for the nudge. I did some research this evening; I think I should make a few tweaks to this training dataset first, then revisit dragnet from the ground up to come up with a modern, ML-based, open-source solution that can compete with current commercial systems. Keep en eye out for releases! :)

TheTravellingSalesman commented 3 years ago

Certainly! I'll look forward to it :-)

bdewilde commented 3 years ago

Commenting here rather than on the PR mentioning this issue — I did actually make lots of progress on tweaking the dataset and hacking on a new concept for dragnet! But, as is often the case with open-source side projects, I had to set it aside before completion owing to other priorities. I would love to get back to it sometime soon, but can't make any promises.

A lower-lift possibility could be just writing code that lets this dataset work with the current version of dragnet. (I think that's doable...) Also properly finalizing the v1 of this dataset, since in hacking I ran into edge cases and other oddities that should probably be handled in the training data itself.

Will keep posting here if/when there's progress!