biglocalnews / clean-scraper

Scraper library and CLI tool to harvest police bodycam footage and other files as part of the Community Law Enforcement Accountability Project (CLEAN)
https://biglocalnews.org/content/collaborations/clean.html
Apache License 2.0
8 stars 10 forks source link

Fix Muckrock filename collisions #165

Open stucka opened 3 weeks ago

stucka commented 3 weeks ago

Additional details

The Muckrock suite is working exactly as planned, except some agencies are sending multiple files with filenames like files.zip and images.png in the same FOI package, just under different branches of the "communication" key in the Muckrock JSON return. Our scraper code then will only make one of those available.

So a few tweaks of the API handling should get us there without too much of a lift:

import the pathlib module for better filename handling

Build an empty dictionary to identify duplicates Go through the communication['files'] entries If a filename native to the API does not exist, add it to the dictionary For each filename in the tally, increase the count by 1

In the existing parser: for communication in communications: should go with an enumerate to track how which sequential entry it belongs to. Store that entry number within the CLEAN scraper properties. If the filename appears in the duplicate dictionary with a count greater than 1, we need to append an identifier. For example, if entries 18 and 23 both had files.zip, the CLEAN filenames would become files_convo18.zip and files_convo23.zip using pathlib.

We also need some accountability here, as @tarakc02 has identified:

confirm that len(set(asset_urls)) == len(set(local_filenames))

This approach should minimize the filename changes but allow everything to get saved on our end.

stucka commented 3 weeks ago

Proposed tweaks to this approach: We've got the file uploaded datetime. I think it'd be better to rename these things like files_from_2022-09-30.zip if that works.

I think we also want another test: len(set(asset_urls)) == len(set(local_filenames)) == len(set(file_entities))