feature: add common crawl to the pipeline

josh-chamberlain commented 7 months ago

This is a simple script which fetches URLs from common crawl. It should accept a keyword like "police" or "misconduct" and return a batch of 10,000 URLs. These URLs will then go into the annotation part of the pipeline.

[ ] create commoncrawler.py (#45)
[ ] move the URLs to the labeling phase
- one idea: using the hugging face dataset called urls-to-label (see urls-relevance)
- generate a new .csv file for each commoncrawler batch
  - working in batches will help us keep things manageable, close things out (remove old irrelevant data), and test different strategies
- run the url-relevance model on each batch, marking urls as relevant and not relevant
  - generate an HF pull request

Now, we can use these datasets to generate labeling tasks in Label Studio, only for relevant urls

Our scripts for triggering this behavior in various APIs could probably use GitHub Actions

mbodeantor commented 7 months ago

@josh-chamberlain Can you share your Figma from yesterday? I believe this would be the start of the annotation pipeline, which hasn't previously been worked on directly

josh-chamberlain commented 7 months ago

@mbodeantor sure here it is. I intended it to be temporary, so I ended up making a PR to this repo making some changes to the flowchart #44

the annotation pipeline itself accepts similar data to the identification pipeline, except humans label it instead of the ML modules.

mbodeantor commented 7 months ago

Okay great, I think it makes most sense to include this as an optional first step in identification.py then, with an optional command line argument flag for annotation to trigger it

maxachis commented 7 months ago

While Josh has cautioned me against taking on too many issues at once, this issue is upstream of #16 and hence I think needs to be addressed first before I can tackle that one.

josh-chamberlain commented 7 months ago

@mbodeantor that sounds fine, but I think one will almost always want to do annotation separately from identification, and we only need to get new batches of URLs periodically.

maxachis commented 7 months ago

@josh-chamberlain @mbodeantor I've created a draft pull request of a 'CommonCrawler' module in the #45 draft PR. So far, this does three things:

Pulls a page of data from a given CommonCrawler index given a search term (you can't do keyword searches for this, I tried)
Tells you how many pages of data (and hence a rough estimate of how many records) a url search pulls
Filters results based on whether a keyword is present within the url (this is done after the call to the API).

Here's a result from me testing this:

Found 1291 pages for *.gov
Estimated 19365000 records for *.gov
Found 14722 records for *.gov
Found 98 Unique URL Roots from 14722 URLs
['tock.18f.gov', 'wwiiregistry.abmc.gov', '400yaahc.gov', 'ncler.acl.gov', 'elderjustice.acl.gov', 'www.abletn.gov', 'aberdeenwa.gov', 'www.acl.gov', 'methods.18f.gov', 'www.abseconnj.gov', 'guides.18f.gov', 'api-all-the-x.18f.gov', 'olderindians.acl.gov', 'natc.acl.gov', 'oam.acl.gov', 'forms.22007apply.gov:443', '9-11commission.gov', 'abilenetx.gov', 'contracting-cookbook.18f.gov', 'atf-eregs.18f.gov', 'www.2020census.gov', '911.gov', 'www.absenteeshawneetribe-nsn.gov', 'seniornutrition.acl.gov', '29palmsbomi-nsn.gov', 'specialeventapp.abilenetx.gov', 'stars.acl.gov', 'www.9-11commission.gov', 'longtermcare.acl.gov', 'accessibility.gov', 'staging.achp.gov', 'accessibility.18f.gov', 'handbook.18f.gov', 'www.400yaahc.gov', 'engineering.18f.gov', 'www.511wi.gov', 'ncapps.acl.gov', 'ux-guide.18f.gov', 'www.aberdeenmd.gov', '511wi.gov', 'agid.acl.gov', 'www.abmc.gov', 'www.abingtonpa.gov', 'before-you-ship.18f.gov', 'apstarc.acl.gov', 'naeji.acl.gov', 'derisking-guide.18f.gov', 'www.access-board.gov', 'abingtonpa.gov', 'norc.acl.gov', 'www.400yaahc.gov:443', '22007apply.gov', 'api.abmc.gov', 'www.abingtonma.gov', 'www.eldercare.acl.gov', 'ww2.abilenetx.gov', 'www2.abmc.gov', 'abingdon-va.gov', 'www.911.gov', 'acquisition.gov', '18f.gov', 'icdr.acl.gov', 'pages.18f.gov', 'digitalaccelerator.18f.gov', '511wi.gov:443', 'acl.gov', 'content-guide.18f.gov', 'DIAL.acl.gov', 'projects.511wi.gov', 'stagencea2.acl.gov', 'smpship.acl.gov', '911commission.gov', 'login-forms.22007apply.gov', 'virtual360.abmc.gov', 'micropurchase.18f.gov', 'dial.acl.gov', 'abingtonma.gov', 'www.abilityone.gov', 'ads.18f.gov', '211dupage.gov', 'www.aberdeenwa.gov', 'oaaps.acl.gov', 'nadrc.acl.gov', 'ncea.acl.gov', 'eldercare.acl.gov', 'www.achp.gov', 'aoa.acl.gov', 'abmc.gov', 'www.acquisition.gov', 'ejcc.acl.gov', 'nwd.acl.gov', 'namrs.acl.gov', 'www.abilenetx.gov', '2020census.gov', 'brand.18f.gov', 'acnj.gov', 'www.acnj.gov', 'www.911commission.gov']
Found 18 URLs with the keyword 'police'
['https://www.abilenetx.gov/police', 'http://abilenetx.gov/police', 'https://www.abilenetx.gov/police', 'https://abingdon-va.gov/featured/departments/abingdon-police-department/toggle-sidebar', 'https://www.abingtonma.gov/police-department', 'https://www.abingtonma.gov/police-department/news/officers-recovered-a-large-amount-of-jewelry-drugs-drug-paraphernalia-a-gun', 'https://www.abmc.gov/db-abmc-burial-unit/military-police-platoon-82th-airborne-division', 'https://www.abseconnj.gov/index.php/police-home', 'http://www.abseconnj.gov/police/', 'http://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'http://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'http://www.acnj.gov/Departments/police/robots.txt', 'https://www.acnj.gov/Departments/police/robots.txt', 'http://www.acnj.gov/Departments/police/robots.txt', 'https://www.acnj.gov/Departments/police/robots.txt']

This is not an efficient process, mind you -- out of 14722 records, I found 18 that are potentially relevant. However, as a means of brute-force acquiring potential URLs, this will do the trick. More sophisticated searches would probably entail making use of search engine APIs (for example -- getting a list of all counties in the US and then searching for a police department for each).

If, however, we consider this sufficient as a first step, I can add some extra features (having it run until it iterates through all or a set number of pages, for example) as well as unit tests, modify the readme to include instructions for it, and then submit it as a full pull request.

josh-chamberlain commented 7 months ago

(my comments are on the PR)

mbodeantor commented 6 months ago

Just merged this first PR. In addition to calling this script in identification_pipeline.py, to fully integrate this we should have it write to the database instead of csv as needed

maxachis commented 6 months ago

@mbodeantor @josh-chamberlain Understood. My next task in this issue would be to connect this to the database.

Which then leads to the questions of how do I connect it to the database, and what part of the database do I connect it too?

Airtable has a python API, so that answers the "how". But then where should this data be hosted?

mbodeantor commented 6 months ago

@maxachis We are mirroring Airtable to our database in DigitalOcean, which is what serving the app data. I will DM you the database connection URL, which you can use in a tool like pgadmin to create a table (let me know if that user doesn't have permission). For the pipeline, you can use the script the app uses: https://github.com/Police-Data-Accessibility-Project/data-sources-app/blob/main/middleware/initialize_psycopg2_connection.py

maxachis commented 6 months ago

Currently developing a PR #52 that should handle uploading the result of urls to the database, once complete (it is not complete).

I'm chatting with @mbodeantor about getting the proper database connections.

Additionally, @josh-chamberlain , will we want to create a Github Actions or something comparable that will automate this and other parts of the pipeline? Is that worth creating an issue about?

maxachis commented 6 months ago

@mbodeantor Was able to get connected to the database, although it seems I don't have permission to create tables (which is reasonable: I'm clearly a loose cannon who shouldn't be trusted 🔫).

That leads to a few questions for both you and @josh-chamberlain:

Am I able to be given permission to create, or at least update, tables?
Do we have a means for backing up and rebuilding our database, in case something goes horribly wrong with the tables?
I am considering creating the following table to track urls. This would be the landing point for all urls pulled from the crawler or other sources. Let me know what you think about this structure:

CREATE TABLE PoliceDataUrls (
    URLID SERIAL PRIMARY KEY,
    URL TEXT NOT NULL,
    SourceName VARCHAR(255),
    DateCrawled TIMESTAMP WITH TIME ZONE NOT NULL,
    Status VARCHAR(10) NOT NULL DEFAULT 'pending' CHECK (Status IN ('pending', 'processed', 'error')),
    DateProcessed TIMESTAMP WITH TIME ZONE,
    ProcessingNotes TEXT,
    DataType VARCHAR(255),
    ContentChecksum VARCHAR(255),
    CONSTRAINT url_unique UNIQUE(URL) -- Ensures that the same URL is not stored multiple times
);

maxachis commented 6 months ago

I additionally want to check in as to whether a SQL database is the best choice for storing this data, as opposed to a NoSQL database. The primary argument for NoSQL is that since this data storage would essentially be a "holding pen" for future processing, it may be enough to put them in a big NoSQL "lump" where we can pull urls out and process them as we see fit.

Then again, if we want to be able to track duplicates, or organize based on status or results from processing, as we previously indicated, a Relational Database might be more warranted. By my understanding, NoSQL is probably sufficient for this, but I want @josh-chamberlain and @mbodeantor 's thoughts.

mbodeantor commented 6 months ago

Sorry for the confusion, not sure we need to write the output of the common crawl if it is included in the pipeline. Just need to write the results of the identification pipeline before they are uploaded to label studio.

maxachis commented 6 months ago

@mbodeantor I think having a temporary store of the urls could be useful as a guard against accidentally losing urls pulled because another part of the pipeline broke. It can take a while to pull a few hundred urls from common crawl, so storing them somewhere for at least a short period of time, and then eliminating them once they're processed to the next portion, could help keep things more resilient.

It would additionally allow us to run different parts of the pipeline on different schedules or frequencies, if we so choose, and to debug those components more easily. For example, it might not be efficient if we run batch processing only on a comparatively small amount of urls -- the startup time for some of those components, especially the more sophisticated machine learning portions, might not be negligible.

Naturally, I'll defer to what y'all think. But that's my two cents.

mbodeantor commented 6 months ago

@maxachis I think @josh-chamberlain's suggestion to store the urls in HF makes sense instead of the database, this is will make a clear distinction between wip and processed data.

maxachis commented 6 months ago

@josh-chamberlain @mbodeantor In that case, I'd need to know how and where to store them in HF. I assume I'd be using the Hugging Face API, but any of the existing datasets, such as https://huggingface.co/datasets/PDAP/urls , or would I be putting it in a new one?

mbodeantor commented 6 months ago

I doesn't look like we have a dataset for unlabeled URLs, looks like we need need a new one for those to go in. I have not interacted with the datasets on there, but I assume the API would be the easiest way. @EvilDrPurple or @bonjarlow might have some suggestions.

bonjarlow commented 6 months ago

@maxachis @mbodeantor yeah we could use PDAP/urls on HF for that intermediate storage. What's in there now is an old, incomplete dataset, so no harm in deleting / repurposing

EvilDrPurple commented 6 months ago

@maxachis I added a new dataset with a csv of about 33000 urls Marty had previously given me. There may be some overlap with already labeled urls though. Feel free to update or change if it's useful to you.

maxachis commented 6 months ago

@EvilDrPurple @bonjarlow @mbodeantor @josh-chamberlain (goodness I'm cc'ing a lot of people here).

In that case, the next thing I need is to be given access to the HF backend, which I don't believe I have already (and am not sure who's responsible for that!). Once I do that, I can start getting in there like some kind of cyber-racoon and start sifting through everything. 🦝

mbodeantor commented 6 months ago

@maxachis Looks like you have write permissions in HF, might be worth a shot.

maxachis commented 6 months ago

@mbodeantor @josh-chamberlain A draft PR for the database upload process is available here #52. The general process is that each crawl execution will create a timestamped csv which will be immediately uploaded to the Huggingface Dataset.

The data as it exists in the hugging face dataset can be found here, within the 'urls' folder. Note that the format of my urls is different than what @EvilDrPurple formatted theirs (mine includes data about what common crawl parameters yielded each url). Let me know if that isn't acceptable.

mbodeantor commented 6 months ago

@maxachis Do you have an example? The dataset you linked looks fine to me, but I do think any extraneous common crawl parameters should be stripped from a url if they are getting attached.

mbodeantor commented 6 months ago

@josh-chamberlain Is there any value in retaining links to social media? Seems unlikely there would be anything relevant there

maxachis commented 6 months ago

@maxachis Do you have an example? The dataset you linked looks fine to me, but I do think any extraneous common crawl parameters should be stripped from a url if they are getting attached.

@mbodeantor Pardon me, I meant the 'urls' folder. Examples are located in there. I'll go ahead and remove the extra columns.

josh-chamberlain commented 6 months ago

@josh-chamberlain Is there any value in retaining links to social media? Seems unlikely there would be anything relevant there

@mbodeantor Unfortunately, some agencies use facebook as their primary homepage.

maxachis commented 6 months ago

@mbodeantor @josh-chamberlain If my draft PR looks like I'm on the right track, I can move forward with

Finishing up the finer details of this component
Developing unit and integration tests for its functionality
Creating a Github Actions yaml file that could be deployed in order to ensure a constant retrieval of data.

mbodeantor commented 6 months ago

@maxachis Yeah looks good to me

maxachis commented 6 months ago

@josh-chamberlain @mbodeantor The finished pull request is in #52. This PR implements both the functionality and a Github Action for the functionality.

As currently designed, the Github Action will crawl 20 pages at a time, every day at 1AM.

This is probably low-balling how many pages we want to crawl at a time, but it's sufficient to test the functionality in the Github Action.

Additionally, we may want to consider other search parameters to use, as we would eventually exhaust the number of urls with 'police' and '.gov' in the url in the current Common Crawl index.

Finally, we may want to substitute for the new Common Crawl Index. The current index used is CC-MAIN-2023-50.

josh-chamberlain commented 6 months ago

@maxachis yeah, let's use the latest index, and update it over time! Instead of police and .gov, we can use the names of agencies or URL patterns found in our known data sources.

maxachis commented 6 months ago

@maxachis yeah, let's use the latest index, and update it over time! Instead of police and .gov, we can use the names of agencies or URL patterns found in our known data sources.

@josh-chamberlain It might be useful to make this as a separate enhancement issue, for a few reasons

This issue and associated PR's already quite large! Don't want to crowd it too much.
Because I've designed it to be fairly flexible in terms of parameters, it shouldn't be too much trouble to swap in new ones
It'll probably be useful to have a conversation about what those url patterns would look like, and to devise a way for consistently updating to the latest index (if possible)

josh-chamberlain commented 6 months ago

@maxachis sounds good, I'm sure we'll play with the parameters in the future and this will be more than sufficient for our first batches. In the future, when we go crawling for specific kinds of records, we can evaluate how we use parameters.

Police-Data-Accessibility-Project / data-source-identification

feature: add common crawl to the pipeline #40