MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

include OAI harvest details in Job details #374

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

Per a recommendation, include some statistics / insight into an OAI harvest. Specifically -- low hanging fruit -- what sets were harvested, and distribution of records across sets.

antmoth commented 5 years ago

So, I made it do this: image

My main concern is that the code I'm using to do this seems inefficient in a way that may or may not matter and if it does matter may or may not be able to be improved. See https://github.com/MI-DPLA/combine/pull/422

ghukill commented 5 years ago

Oooooo, this is awesome! I think this is exactly what some people have been asking for.

I do think you're right though, as it stands now, looping through all the records, may be ultimately inefficient for large Jobs. Thankfully, I think we could lean on the Django/Mongo ORM to pull these counts pretty quickly.

Don't have a one liner handy, but something like the following might work:

# get Job's records as QuerySet (MongoEngine, but very similar to native Django SQL ORM)
job_records = Job.objects.get(pk=224).get_records()

# get OAI sets from Job records
job_records.values_list('oai_set').distinct('oai_set')
Out[21]: 
['wayne:collectioncfai',
 'wayne:collectionhermanmiller',
 'wayne:collectionrencen',
 'wayne:collectionmim']

job_records.filter(oai_set='wayne:collectioncfai').count()
Out[22]: 2292

Then, to avoid calculating these each time a Job is loaded, one option may be to store in a Job's job_details, which is a JSON object that is storing these exact kinds of things (field mapping metrics, etc.). Could be stored on Job finish, or first time Job is loaded.

But it's looking awesome. Happy to keep spitballing, but in short, I would think leaning on the ORM might be a good option. And if necessary, it that ends up being costly for huge jobs, could count with Spark and write to job_details that way.

antmoth commented 5 years ago

It turns out that mongo totally has a function to do this: item_frequencies. New commit pushed!

ghukill commented 5 years ago

Brilliant! 😎