cccs-web / core

CCCS' customized django web application
4 stars 11 forks source link

quick tool to show `shasum` #180

Open cccs-ip opened 9 years ago

cccs-ip commented 9 years ago

it might be helpful during clean-up to see the duplicates

pwhipp commented 9 years ago

sha is a metadata field. As discussed, I've kicked the function off (it may take a few days to complete). It is hogging memory so let me know if I need to kill it and write a better version.

Once populated, we can use the sha meta data field to collect up duplicates. This can be done with a bit of shell magic as a one off (I could build this into a web page if needed):

In [1]: import documents.models as dm

In [2]: from django.db.models import Count
In [3]: duplicates = (d for d in dm.Document.objects.values('sha').annotate(dcount=Count('sha')) if d['dcount'] > 1)

In [4]: next(duplicates)
...
cccs-ip commented 9 years ago

Cool, thanks. I will leave this open and assigned to you to think about as we start work on the uploader / importer.