Feature: training dataset maintenance

Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.

MIT License

5 stars 6 forks source link

Feature: training dataset maintenance #49

Open josh-chamberlain opened 5 months ago

josh-chamberlain commented 5 months ago

Context

Now that we've done it a few times, let's be systematic about how we update the base training dataset in Hugging Face.

Requirements

[x] #68
[x] establish training-urls as the canonical dataset in Hugging Face (HF), which we will use Pull Requests in to maintain. We will likely remove the other ones.
- let's only add URLs when we get them labeled in Label Studio
- let's keep the batch ID in hugging face, so we can easily remove a batch from the training data without complex lookups
https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/88
https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/89

Docs

update this repo's readme with reference to the script and action

maxachis commented 5 months ago

[ ] check for new URLs in our database which aren't already in the training-urls dataset via the API

Where will these new raw URLs be hosted? At the moment, my common_crawler pr #45 simply stores new URLs in the repository, which obviously isn't a sustainable long-term option.

josh-chamberlain commented 5 months ago

@maxachis this is a good point—in general, I think we should use hugging face datasets.

I added detail to this issue: https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/40

maxachis commented 5 months ago

@josh-chamberlain To make sure I fully understand the workflow:

Pull URLs from database
Pull URLs from training-urls dataset
Get all URLs from 1 which are not in 2
Run HTML tag collector on results from 3
Insert these results into LabelStudio (need confirmation especially on this step)
Take results from LabelStudio
Merge with URLs from training-urls dataset pulled in 2. Update the last_updated property of all new entries (or of the entire dataset?)
Put the results of 7 into training-urls dataset

Additionally, when we are talking about training-urls dataset, does this dataset currently exist? In PDAP's Hugging Face, I do not currently see a dataset named training-urls:

josh-chamberlain commented 5 months ago

@maxachis

~~5. we don't need to insert into LabelStudio—to be clear, we are checking LabelStudio for newly labeled URLs which aren't already in our training data.~~

training-urls doesn't currently exist, creating the dataset + strategy for managing batches of URLs within it.

josh-chamberlain commented 3 months ago

I updated the readme for this repo and tweaked this issue slightly—I think using Hugging Face as a database for un-labeled URLs is not needed. We can track batches by ID in github, but we don't need to put them in hugging face before they're labeled. Hopefully this is much simpler.