IDR / idr.openmicroscopy.org

Source for the IDR static website.
https://idr.openmicroscopy.org/about
Creative Commons Attribution 4.0 International
4 stars 16 forks source link

Update release stats #92

Open dominikl opened 4 years ago

dominikl commented 4 years ago

After each release the stats have to be updated. Most figures can be acquired via omero fs usage and stats.py script.

Problem 1:

studies.tsv wants: Study | Container | Introduced | Internal ID | Sets | Wells | Experiments (wells for screens, imaging experiments for non-screens) | Targets (genes, small molecules, geographic locations, or combination of factors (idr0019, 26, 34, 38) | Acquisitions | 5D Images | Planes | Size (TB) | Size | # of Files | avg. size (MB) | Avg. Image Dim (XYZCT)

From stats.py you'll get Container | ID | Set | Wells | Images | Planes | Bytes Example: idr0052-walther-condensinmap/experimentA | 752 | 44 of 54 | 0 | 282 | 699360 | 85.4 GB What does 44 of 54 sets mean? What is Bytes, does that have to be used for Size (TB) and Size?

omero fs usage give you something like Total disk usage: 115773571855 bytes in 25 files . What about this size? And is the 25 files the # of Files?

The workflow doc has an hql query how to get the Avg. Image Dim (XYZCT), but only for projects not for screens.

And how to get Targets? As this can be multiple things, can't think of an easy/generic script which can go through any annotation.csv and pull the number of unique 'targets'.

Problem 2

releases.tsv wants: Date | Data release | Code version | Sets | Wells | Experiments | Images | Planes | Size (TB) | Files (Million) | DB Size (GB) From stats.py you'll get some of it: Container | ID | Set | Wells | Images | Planes | Bytes Total | | 13044 | 1213175 | 9150589 | 65571290 | 334.2 TB
But where to get Files (Million) from? And how to get DB Size (GB)?

/cc @sbesson wasn't really sure where to open the issue, here (stats) or idr-utils (stats.py script).

manics commented 4 years ago

In addition we have a spreadsheet which is almost but not quite the same format as these tsv files. It'd be good to make sure the solution here is also correct for the spreadsheet (or maybe we can get rid of it?)

joshmoore commented 4 years ago

What does 44 of 54 sets mean?

Part of this is the split between "Plates" and "Datasets". I also often have to figure it out by context. Happy to have the output format from the script be made more explicit.

What is Bytes, does that have to be used for Size (TB) and Size?

Bytes from stats.py was my first attempt at a size via SQL. It was pointed out that 1) my query was wrong and 2) it doesn't match what fs usage was providing. Best option is likely to remove it.

What about this size?

Size in TB is just an easier to read version of Size

And is the 25 files the # of Files?

Yes.

And how to get Targets?

This is a difficult one, and likely since Eleanor left hasn't been maintained or even defined.

But where to get Files (Million) from?

Again, this is just an easier to read version of Files.

And how to get DB Size (GB)?

I think we have some diversity here. I'd suggest select pg_database_size('idr') is the basis for most of the values.

In addition we have a spreadsheet which is almost but not quite the same format as these tsv files. It'd be good to make sure the solution here is also correct for the spreadsheet (or maybe we can get rid of it?)

:+1: for having the solution work for both. I still use the spreadsheet, so until we have everything in one place I'd be :-1: for getting rid of it.

sbesson commented 4 years ago

A few additional comments,

Re Targets, this is a metric that is quite valuable but cannot simply be queried for the reasons described above as it requires some knowledge on the study itself. Given it has not been maintained for a while, happy to discuss removing it from the maintained stats format for now until we properly get back to it.

Re csv vs spreadhseet, I am pretty sure the headers were matching when I created the tsv files. If that's not the case, I am all for re-aligning it as it should work as cut-n-paste

Proposed actions:

sbesson commented 4 years ago

I think https://github.com/IDR/idr-utils/pull/16/ addresses most of the issues raised above related to studies.tsv.

For releases.tsv, I think most of the columns can be computed from the studies.tsv except for the release date and the database size. I am erring on the side of a separate small script that will do this calculation and take the additional values as input parameters. Or a subcommand of stats.py.