emilycantrell / stork_oracle_cbs

0 stars 0 forks source link

Set up infrastructure for exporting results #3

Open emilycantrell opened 3 weeks ago

emilycantrell commented 3 weeks ago

Tommy & Emily discussed this on 2024-08-23.

To do:

Proposed columns to include in the export file (tentative; this needs more discussion) (these are just notes about ideas, not intended as a spec of actual column names):

emilycantrell commented 2 weeks ago

This afternoon, Tommy and I met and discussed what results we want to export, and how to format the exported file.

What results to export

We want to export the following information:

Notes:

Clearly we also need to figure out better terminology :) The word "pipeline" might be useful here, but for now I am trying to be as explicit as possible about what I'm referring to.

How to format the export file

We also discussed how to format the export file. I've attached a draft of how the export file might be formatted. This is far from final, but we can use it as a starting place for discussion tomorrow.

draft_export_format.xlsx

We are inclined to use csv as the export format for two reasons: (1) in R, it's easiest to work with rectangular data, (2) the CBS export guidelines are based on the number of "cells" that will be exported, so they are probably most familiar with rectangular exports. However, we are open to using json or some alternative format if there is a compelling reason.

Meeting tomorrow

@HanzhangRen @msalganik @jcperdomo Figuring out what to export and how to format it has been more challenging than I expected. When we have our Stork Oracle meeting tomorrow, I'd like to start by discussing WHAT we want to export. Once we finalize that decision, we can return to HOW we want to export it.

emilycantrell commented 2 weeks ago

Discussed 2024-08-29 (Tommy & Emily meeting):

emilycantrell commented 2 weeks ago

Discussed 2024-08-29 (Stork Oracle meeting):

Updated list of things that we want to have in the results:

See: https://www.tidyverse.org/blog/2022/11/model-calibration/

emilycantrell commented 2 weeks ago

During the meeting we talked about exporting calibration plot data as follows: for each decile of probabilities (or whatever percentile/width we choose), create a column that contains the fraction of positives. (separately, we would also have a column with a scaler metric of calibration)

After the meeting, Tommy pointed out that based on the current file structure, we would not be adding additional columns, we would be adding additional rows. This would make a dataframe that is already very long into a dataframe that is much, much longer.

We will think about whether to put the results file in wide format instead of long. Alternatively, the calibration data can be in a separate file rather than in the main results file.

msalganik commented 2 weeks ago

@varunsatish given our conversation this afternoon, you might want to read this before the meeting on Friday. Very related to what we were talking about.

emilycantrell commented 2 weeks ago

Proposal to switch scores storage from long to wide

I propose that rather than storing the scores in long format, we store them in wide format. This will substantially reduce the number of cells in the output file. For example, v1 and v2 (attached) contain identical amounts of information, but v1 has 250 cells, and v2 has 110 cells. The proportional gap between them will get even larger when we include all the scores we want to export, and especially if we add calibration data using the same format as the overall scores.

draft_export_format_v1.xlsx draft_export_format_v2_with_same_data_as_v1_to_compare_cell_counts.xlsx (data in the drafts is fake)

Thoughts?

These drafts don't contain all the edits discussed today. I'll post a fully updated draft before we meet tomorrow.

msalganik commented 2 weeks ago

Interesting @emilycantrell. At first I had trouble noticing the difference between the two but then I figured it out. It seems like the general principle you are following is trying to reduce the amount of redundancy. V2 is shorter because it does not repeat the same information many times.

I wonder if you could take this even further by making two tables and then joining them together. For example, you could have one table that stores run_ids and data about each run (the stuff from the job file). Then you could have a results table that includes the run_ids and the results. Then you could export them and merge outside the RA.

That said, I'm not sure how much reducing the size of the output actually matters.

Here's something from Mark "We will have to reformat this into an excel and ideally keep it below 1,000 cells across less than 8 tables. Then it will qualify as a light output. Is it possible to reformat this output in such a format and share? Then I can share that with CBS. They asked for an example wrt bulk"

msalganik commented 2 weeks ago

@varunsatish I think this comment from Mark from Slack is important.
"We will have to reformat this into an excel and ideally keep it below 1,000 cells across less than 8 tables. Then it will qualify as a light output. Is it possible to reformat this output in such a format and share? Then I can share that with CBS. They asked for an example wrt bulk" Do you think we can get the Cruijff results out with less than 1,000 cells and less than 8 tables?

varunsatish commented 2 weeks ago

@msalganik Yes. The untidy version I showed you yesterday was at about 1020 cells.

emilycantrell commented 2 weeks ago

Discussed 2024-08-30 (EC VS HR MS):

We will store results files on OneDrive, using a descriptive naming scheme that includes the date and the purpose of the run.