ebmdatalab / outliers

repo for outlier detection work
MIT License
0 stars 0 forks source link

Notebook to calculate summary statistics #43

Closed LisaHopcroft closed 2 years ago

LisaHopcroft commented 2 years ago

This PR contains a notebook that calculates summary statistics across the different entities for the (1) ratio and (2) Z scores that are calculated in the outlier detection methodology.

This works in that I get summary statistics as desired, but a TypeError is thrown by r.run():

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-65f6f5f86dd5> in <module>
----> 1 r.run()

/home/app/notebook/lib/outliers.py in run(self)
    788             for f in self._run_entity_report(e):
    789                 self.run_results[e].append(f)
--> 790                 self.toc.add_item(**f)
    791 
    792         # write out toc

TypeError: add_item() argument after ** must be a mapping, not BrokenProcessPool 

I can't see an option to turn report writing off in the Runner.run() function - how can I avoid this error from being thrown?

Jongmassey commented 2 years ago

Rather than running the whole report generation process using Runner.run() it is possible just to execute the sql queries and fetch the results using

r.build.run()
r.build.fetch_results()
Jongmassey commented 2 years ago

Counts of the number of chemicals for which we have data (Z scores etc) within each type of organisation.

I'm not sure this is right. This count is of the number of distinct outlying chemicals (in the case of the configured build: top5 or bottom5 z scores) at each entity type.

Similarly,

Summary statistics for the z score in each organisation type

might make more sense split high/low else you're just averaging out the high-outlier z values with the low-outlier z values.

LisaHopcroft commented 2 years ago

Ah, OK. I'll reword and split the results into top 5 and bottom 5.

LisaHopcroft commented 2 years ago

Following your comment about the median value in the grant report, I revisited the queries and the notebook I used to generate the overall summary statistics and the outlying summary statistics respectively.

In doing so, I realised that I was only looking at the top/bottom 5 so I have updated the notebook above to look for the top/bottom ten, which will change the figures in the table.

LisaHopcroft commented 2 years ago

For our records, the query that calculates the medians is here.