WIPACrepo / file_catalog

Store file metadata information in a file catalog
MIT License
1 stars 4 forks source link

Census/Cleanup Non-Data Files #120

Open ric-evans opened 2 years ago

ric-evans commented 2 years ago

There are files in the FC that are not data files, like .cxx files. Are these abundant? This mostly happens in bulk ingestion. If so we should clean them up, and place guardrails to prevent future indexing like this.

blinkdog commented 1 year ago
blinkdog commented 1 year ago

I have some more clean-up that's specific to LTA here: https://github.com/WIPACrepo/lta/issues/236

ric-evans commented 1 year ago

If it's a prescriptive cleanup, it's doable. But crawling the FC (an actual census) is pretty much impossible with current timeout values (mongo and/or REST, I'm unsure). So we'd have to iterate through a dump, offline. @dsschult your thoughts?