Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Feature: training dataset maintenance #49

Open josh-chamberlain opened 5 months ago

josh-chamberlain commented 5 months ago

Context

Now that we've done it a few times, let's be systematic about how we update the base training dataset in Hugging Face.

Requirements

Docs

maxachis commented 5 months ago
  • [ ] check for new URLs in our database which aren't already in the training-urls dataset via the API

Where will these new raw URLs be hosted? At the moment, my common_crawler pr #45 simply stores new URLs in the repository, which obviously isn't a sustainable long-term option.

josh-chamberlain commented 5 months ago

@maxachis this is a good point—in general, I think we should use hugging face datasets.

I added detail to this issue: https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/40

maxachis commented 5 months ago

@josh-chamberlain To make sure I fully understand the workflow:

  1. Pull URLs from database
  2. Pull URLs from training-urls dataset
  3. Get all URLs from 1 which are not in 2
  4. Run HTML tag collector on results from 3
  5. Insert these results into LabelStudio (need confirmation especially on this step)
  6. Take results from LabelStudio
  7. Merge with URLs from training-urls dataset pulled in 2. Update the last_updated property of all new entries (or of the entire dataset?)
  8. Put the results of 7 into training-urls dataset

Additionally, when we are talking about training-urls dataset, does this dataset currently exist? In PDAP's Hugging Face, I do not currently see a dataset named training-urls:

image

josh-chamberlain commented 5 months ago

@maxachis

5. we don't need to insert into LabelStudio—to be clear, we are checking LabelStudio for newly labeled URLs which aren't already in our training data.

training-urls doesn't currently exist, creating the dataset + strategy for managing batches of URLs within it.

josh-chamberlain commented 3 months ago

I updated the readme for this repo and tweaked this issue slightly—I think using Hugging Face as a database for un-labeled URLs is not needed. We can track batches by ID in github, but we don't need to put them in hugging face before they're labeled. Hopefully this is much simpler.