Open cccs-ip opened 10 years ago
sha is a metadata field. As discussed, I've kicked the function off (it may take a few days to complete). It is hogging memory so let me know if I need to kill it and write a better version.
Once populated, we can use the sha meta data field to collect up duplicates. This can be done with a bit of shell magic as a one off (I could build this into a web page if needed):
In [1]: import documents.models as dm
In [2]: from django.db.models import Count
In [3]: duplicates = (d for d in dm.Document.objects.values('sha').annotate(dcount=Count('sha')) if d['dcount'] > 1)
In [4]: next(duplicates)
...
Cool, thanks. I will leave this open and assigned to you to think about as we start work on the uploader / importer.
it might be helpful during clean-up to see the duplicates