Closed josh-chamberlain closed 6 months ago
@josh-chamberlain Can you share your Figma from yesterday? I believe this would be the start of the annotation pipeline, which hasn't previously been worked on directly
@mbodeantor sure here it is. I intended it to be temporary, so I ended up making a PR to this repo making some changes to the flowchart #44
the annotation pipeline itself accepts similar data to the identification pipeline, except humans label it instead of the ML modules.
Okay great, I think it makes most sense to include this as an optional first step in identification.py then, with an optional command line argument flag for annotation to trigger it
While Josh has cautioned me against taking on too many issues at once, this issue is upstream of #16 and hence I think needs to be addressed first before I can tackle that one.
@mbodeantor that sounds fine, but I think one will almost always want to do annotation
separately from identification
, and we only need to get new batches of URLs periodically.
@josh-chamberlain @mbodeantor I've created a draft pull request of a 'CommonCrawler' module in the #45 draft PR. So far, this does three things:
Here's a result from me testing this:
Found 1291 pages for *.gov
Estimated 19365000 records for *.gov
Found 14722 records for *.gov
Found 98 Unique URL Roots from 14722 URLs
['tock.18f.gov', 'wwiiregistry.abmc.gov', '400yaahc.gov', 'ncler.acl.gov', 'elderjustice.acl.gov', 'www.abletn.gov', 'aberdeenwa.gov', 'www.acl.gov', 'methods.18f.gov', 'www.abseconnj.gov', 'guides.18f.gov', 'api-all-the-x.18f.gov', 'olderindians.acl.gov', 'natc.acl.gov', 'oam.acl.gov', 'forms.22007apply.gov:443', '9-11commission.gov', 'abilenetx.gov', 'contracting-cookbook.18f.gov', 'atf-eregs.18f.gov', 'www.2020census.gov', '911.gov', 'www.absenteeshawneetribe-nsn.gov', 'seniornutrition.acl.gov', '29palmsbomi-nsn.gov', 'specialeventapp.abilenetx.gov', 'stars.acl.gov', 'www.9-11commission.gov', 'longtermcare.acl.gov', 'accessibility.gov', 'staging.achp.gov', 'accessibility.18f.gov', 'handbook.18f.gov', 'www.400yaahc.gov', 'engineering.18f.gov', 'www.511wi.gov', 'ncapps.acl.gov', 'ux-guide.18f.gov', 'www.aberdeenmd.gov', '511wi.gov', 'agid.acl.gov', 'www.abmc.gov', 'www.abingtonpa.gov', 'before-you-ship.18f.gov', 'apstarc.acl.gov', 'naeji.acl.gov', 'derisking-guide.18f.gov', 'www.access-board.gov', 'abingtonpa.gov', 'norc.acl.gov', 'www.400yaahc.gov:443', '22007apply.gov', 'api.abmc.gov', 'www.abingtonma.gov', 'www.eldercare.acl.gov', 'ww2.abilenetx.gov', 'www2.abmc.gov', 'abingdon-va.gov', 'www.911.gov', 'acquisition.gov', '18f.gov', 'icdr.acl.gov', 'pages.18f.gov', 'digitalaccelerator.18f.gov', '511wi.gov:443', 'acl.gov', 'content-guide.18f.gov', 'DIAL.acl.gov', 'projects.511wi.gov', 'stagencea2.acl.gov', 'smpship.acl.gov', '911commission.gov', 'login-forms.22007apply.gov', 'virtual360.abmc.gov', 'micropurchase.18f.gov', 'dial.acl.gov', 'abingtonma.gov', 'www.abilityone.gov', 'ads.18f.gov', '211dupage.gov', 'www.aberdeenwa.gov', 'oaaps.acl.gov', 'nadrc.acl.gov', 'ncea.acl.gov', 'eldercare.acl.gov', 'www.achp.gov', 'aoa.acl.gov', 'abmc.gov', 'www.acquisition.gov', 'ejcc.acl.gov', 'nwd.acl.gov', 'namrs.acl.gov', 'www.abilenetx.gov', '2020census.gov', 'brand.18f.gov', 'acnj.gov', 'www.acnj.gov', 'www.911commission.gov']
Found 18 URLs with the keyword 'police'
['https://www.abilenetx.gov/police', 'http://abilenetx.gov/police', 'https://www.abilenetx.gov/police', 'https://abingdon-va.gov/featured/departments/abingdon-police-department/toggle-sidebar', 'https://www.abingtonma.gov/police-department', 'https://www.abingtonma.gov/police-department/news/officers-recovered-a-large-amount-of-jewelry-drugs-drug-paraphernalia-a-gun', 'https://www.abmc.gov/db-abmc-burial-unit/military-police-platoon-82th-airborne-division', 'https://www.abseconnj.gov/index.php/police-home', 'http://www.abseconnj.gov/police/', 'http://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'http://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'https://www.acnj.gov/Departments/police/', 'http://www.acnj.gov/Departments/police/robots.txt', 'https://www.acnj.gov/Departments/police/robots.txt', 'http://www.acnj.gov/Departments/police/robots.txt', 'https://www.acnj.gov/Departments/police/robots.txt']
This is not an efficient process, mind you -- out of 14722 records, I found 18 that are potentially relevant. However, as a means of brute-force acquiring potential URLs, this will do the trick. More sophisticated searches would probably entail making use of search engine APIs (for example -- getting a list of all counties in the US and then searching for a police department for each).
If, however, we consider this sufficient as a first step, I can add some extra features (having it run until it iterates through all or a set number of pages, for example) as well as unit tests, modify the readme to include instructions for it, and then submit it as a full pull request.
(my comments are on the PR)
Just merged this first PR. In addition to calling this script in identification_pipeline.py, to fully integrate this we should have it write to the database instead of csv as needed
@mbodeantor @josh-chamberlain Understood. My next task in this issue would be to connect this to the database.
Which then leads to the questions of how do I connect it to the database, and what part of the database do I connect it too?
Airtable has a python API, so that answers the "how". But then where should this data be hosted?
@maxachis We are mirroring Airtable to our database in DigitalOcean, which is what serving the app data. I will DM you the database connection URL, which you can use in a tool like pgadmin to create a table (let me know if that user doesn't have permission). For the pipeline, you can use the script the app uses: https://github.com/Police-Data-Accessibility-Project/data-sources-app/blob/main/middleware/initialize_psycopg2_connection.py
Currently developing a PR #52 that should handle uploading the result of urls to the database, once complete (it is not complete).
I'm chatting with @mbodeantor about getting the proper database connections.
Additionally, @josh-chamberlain , will we want to create a Github Actions or something comparable that will automate this and other parts of the pipeline? Is that worth creating an issue about?
@mbodeantor Was able to get connected to the database, although it seems I don't have permission to create tables (which is reasonable: I'm clearly a loose cannon who shouldn't be trusted 🔫).
That leads to a few questions for both you and @josh-chamberlain:
CREATE TABLE PoliceDataUrls (
URLID SERIAL PRIMARY KEY,
URL TEXT NOT NULL,
SourceName VARCHAR(255),
DateCrawled TIMESTAMP WITH TIME ZONE NOT NULL,
Status VARCHAR(10) NOT NULL DEFAULT 'pending' CHECK (Status IN ('pending', 'processed', 'error')),
DateProcessed TIMESTAMP WITH TIME ZONE,
ProcessingNotes TEXT,
DataType VARCHAR(255),
ContentChecksum VARCHAR(255),
CONSTRAINT url_unique UNIQUE(URL) -- Ensures that the same URL is not stored multiple times
);
I additionally want to check in as to whether a SQL database is the best choice for storing this data, as opposed to a NoSQL database. The primary argument for NoSQL is that since this data storage would essentially be a "holding pen" for future processing, it may be enough to put them in a big NoSQL "lump" where we can pull urls out and process them as we see fit.
Then again, if we want to be able to track duplicates, or organize based on status or results from processing, as we previously indicated, a Relational Database might be more warranted. By my understanding, NoSQL is probably sufficient for this, but I want @josh-chamberlain and @mbodeantor 's thoughts.
Sorry for the confusion, not sure we need to write the output of the common crawl if it is included in the pipeline. Just need to write the results of the identification pipeline before they are uploaded to label studio.
@mbodeantor I think having a temporary store of the urls could be useful as a guard against accidentally losing urls pulled because another part of the pipeline broke. It can take a while to pull a few hundred urls from common crawl, so storing them somewhere for at least a short period of time, and then eliminating them once they're processed to the next portion, could help keep things more resilient.
It would additionally allow us to run different parts of the pipeline on different schedules or frequencies, if we so choose, and to debug those components more easily. For example, it might not be efficient if we run batch processing only on a comparatively small amount of urls -- the startup time for some of those components, especially the more sophisticated machine learning portions, might not be negligible.
Naturally, I'll defer to what y'all think. But that's my two cents.
@maxachis I think @josh-chamberlain's suggestion to store the urls in HF makes sense instead of the database, this is will make a clear distinction between wip and processed data.
@josh-chamberlain @mbodeantor In that case, I'd need to know how and where to store them in HF. I assume I'd be using the Hugging Face API, but any of the existing datasets, such as https://huggingface.co/datasets/PDAP/urls , or would I be putting it in a new one?
I doesn't look like we have a dataset for unlabeled URLs, looks like we need need a new one for those to go in. I have not interacted with the datasets on there, but I assume the API would be the easiest way. @EvilDrPurple or @bonjarlow might have some suggestions.
@maxachis @mbodeantor yeah we could use PDAP/urls on HF for that intermediate storage. What's in there now is an old, incomplete dataset, so no harm in deleting / repurposing
@maxachis I added a new dataset with a csv of about 33000 urls Marty had previously given me. There may be some overlap with already labeled urls though. Feel free to update or change if it's useful to you.
@EvilDrPurple @bonjarlow @mbodeantor @josh-chamberlain (goodness I'm cc'ing a lot of people here).
In that case, the next thing I need is to be given access to the HF backend, which I don't believe I have already (and am not sure who's responsible for that!). Once I do that, I can start getting in there like some kind of cyber-racoon and start sifting through everything. 🦝
@maxachis Looks like you have write permissions in HF, might be worth a shot.
@mbodeantor @josh-chamberlain A draft PR for the database upload process is available here #52. The general process is that each crawl execution will create a timestamped csv which will be immediately uploaded to the Huggingface Dataset.
The data as it exists in the hugging face dataset can be found here, within the 'urls' folder. Note that the format of my urls is different than what @EvilDrPurple formatted theirs (mine includes data about what common crawl parameters yielded each url). Let me know if that isn't acceptable.
@maxachis Do you have an example? The dataset you linked looks fine to me, but I do think any extraneous common crawl parameters should be stripped from a url if they are getting attached.
@josh-chamberlain Is there any value in retaining links to social media? Seems unlikely there would be anything relevant there
@maxachis Do you have an example? The dataset you linked looks fine to me, but I do think any extraneous common crawl parameters should be stripped from a url if they are getting attached.
@mbodeantor Pardon me, I meant the 'urls' folder. Examples are located in there. I'll go ahead and remove the extra columns.
@josh-chamberlain Is there any value in retaining links to social media? Seems unlikely there would be anything relevant there
@mbodeantor Unfortunately, some agencies use facebook
as their primary homepage.
@mbodeantor @josh-chamberlain If my draft PR looks like I'm on the right track, I can move forward with
@maxachis Yeah looks good to me
@josh-chamberlain @mbodeantor The finished pull request is in #52. This PR implements both the functionality and a Github Action for the functionality.
As currently designed, the Github Action will crawl 20 pages at a time, every day at 1AM.
This is probably low-balling how many pages we want to crawl at a time, but it's sufficient to test the functionality in the Github Action.
Additionally, we may want to consider other search parameters to use, as we would eventually exhaust the number of urls with 'police' and '.gov' in the url in the current Common Crawl index.
Finally, we may want to substitute for the new Common Crawl Index. The current index used is CC-MAIN-2023-50
.
@maxachis yeah, let's use the latest index, and update it over time! Instead of police and .gov, we can use the names of agencies or URL patterns found in our known data sources.
@maxachis yeah, let's use the latest index, and update it over time! Instead of police and .gov, we can use the names of agencies or URL patterns found in our known data sources.
@josh-chamberlain It might be useful to make this as a separate enhancement issue, for a few reasons
@maxachis sounds good, I'm sure we'll play with the parameters in the future and this will be more than sufficient for our first batches. In the future, when we go crawling for specific kinds of records, we can evaluate how we use parameters.
This is a simple script which fetches URLs from common crawl. It should accept a
keyword
like "police" or "misconduct" and return a batch of10,000
URLs. These URLs will then go into the annotation part of the pipeline.urls-to-label
(seeurls-relevance
)url-relevance
model on each batch, marking urls asrelevant
andnot relevant
Now, we can use these datasets to generate labeling tasks in Label Studio, only for
relevant
urlsOur scripts for triggering this behavior in various APIs could probably use GitHub Actions