Add the raw data statistics for all published studies

sbesson commented 10 months ago

As discused last Monday at the IDR weekly meeting, with the ongoing migration of the public downloadable data to https://ftp.ebi.ac.uk/pub/databases/IDR/ it is useful for end-users to know the amount of data for each study.

This PR uses the statistic that came out of the transfer command to capture the number of files and the total size in bytes for each top-level study. The last column attempts to normalize the size in TB but arguably this can be recomputed from the third column so leaving it up to the reviewers to decide whether this is useful.

sbesson commented 10 months ago

Note the current state of this PR omits the data for HPA as I am still transferring the last few folders but I will update it as soon as I have the final figures for the current data. Excluding HPA, the volume of data available for download is ~40M files for ~260TB of data

joshmoore commented 10 months ago

This looks straight forward but I imagine quite useful. (The idea that you just append the first column to https://ftp.ebi.ac.uk/pub/databases/IDR/ is lovely :tada:) Probably the only question is whether or not keeping this up-to-date is too onerous.

will-moore commented 10 months ago

This looks similar to the file we're using for stats on the IDR home page: https://raw.githubusercontent.com/IDR/idr.openmicroscopy.org/master/_data/studies.tsv I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both? How will this rawdata.tsv be "viewable" on the website?

sbesson commented 10 months ago

Probably the only question is whether or not keeping this up-to-date is too onerous.

With the current transfer script, the reported numbers are actually generated in the log e.g.

[idr-virtual@codon-slurm-login-02 ~]$ tail -n 10 completed/idr0001-graml-sysgro_out.37867557 
[2023-11-09T23:02:06] Seconds: 93172.864
[2023-11-09T23:02:06] Items: 411555
[2023-11-09T23:02:06]   Directories: 0
[2023-11-09T23:02:06]   Files: 411555
[2023-11-09T23:02:06]   Links: 0
[2023-11-09T23:02:06] Data: 34.250 TiB (37658654203804 bytes)
[2023-11-09T23:02:06] Rate: 385.457 MiB/s (37658654203804 bytes in 93172.864 seconds)
[2023-11-09T23:02:06] Updating timestamps on newly copied files
[2023-11-09T23:03:30] Completed updating timestamps
[2023-11-09T23:03:30] Completed sync

So I expect the cost of maintaining this file will be very low (but I definitely need to document the above).

I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both?

This file capture filesystem metrics for the data we receive from the submitter and made available for direct download. This includes all image files, analysis files etc only a subset of which is being imported/registered in IDR. On the other hand, studies.tsv and only reports the imaging data that is imported into OMERO and is also broken down by container (screen/project) rather than being study-wide

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

pwalczysko commented 10 months ago

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

I like the table and the idea. Lets remember the main purpose of this: Give the downloaders (== NOT the OME Team) the overview of what are the approximate sizes of what they are downloading.
Give the information in such a way that it is available at the place where the download happens, at the time of the download. I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

sbesson commented 10 months ago

Give the information in such a way that it is available at the place where the download happens, at the time of the download.

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in https://github.com/IDR/idr.openmicroscopy.org/pull/188#issuecomment-1825589649. Also thinking of the process, assuming we make the right decisions, managing this information directly under this hierarchy is even easier as the metadata can be updated directly once the data is copied.

I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

That probably gets us down to agreeing the minimal requirements for a first version:

format: currently set as is TSV mostly to match the existing tabular files in this repository. Can easily be CSV or even JSON
data: study name, number of files and total size (bytes) are the bare minimum columns. Everything else is up for discussion (or future amendments)

joshmoore commented 10 months ago

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in https://github.com/IDR/idr.openmicroscopy.org/pull/188#issuecomment-1825589649.

Agreed, though would it be easier/as-effective to just have a README per directory then?

pwalczysko commented 10 months ago

Agreed, though would it be easier/as-effective to just have a README per directory then?

I am happy with that @sbesson

sbesson commented 10 months ago

Closing in favour of files directly hosted on the public storage infrastructure:

a top-level CSV file - see https://ftp.ebi.ac.uk/pub/databases/IDR/studies.csv
individual readme files under each study folder TBD

IDR / idr.openmicroscopy.org

Add the raw data statistics for all published studies #188