LSSTDESC / rail_attic

Redshift Assessment Infrastructure Layers
MIT License
14 stars 9 forks source link

Issue/369/goldenspike pipeline #372

Closed OliviaLynn closed 11 months ago

OliviaLynn commented 1 year ago

For LSSTDESC/rail_hub#15.

(Also: fixed some outdated goldenspike.ipynb paths I'd missed in an earlier PR)

codecov[bot] commented 1 year ago

Codecov Report

Patch and project coverage have no change.

Comparison is base (393d03e) 100.00% compared to head (49c818a) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main LSSTDESC/RAIL#372 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 38 38 Lines 2590 2590 ========================================= Hits 2590 2590 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `100.00% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=LSSTDESC#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

eacharles commented 1 year ago

I just saw your comment about the base_catalog.pq file versus the pretrained_flow.pkl file. These are actually very different things. The pretrained_flow file is a model that gets made by the FlowModeler stage called something like flow_creator_train I think.
The base_catalog.pq is an input file that gets used to by flow_creator_train to make a new version of the model , i.e., a not_so_pretrained_flow.pkl file.

Does that help clarify things?

OliviaLynn commented 1 year ago

Sort of--but the pre-existing pipeline uses

inputs:
  flow: examples/goldenspike/data/pretrained_flow.pkl

while the pipeline output by running goldenspike.ipynb uses

inputs:
  input: <mypathtorail>/examples/goldenspike_examples/data/base_catalog.pq

What should be used as the input?

If I'm following, I'd guess we would want the model file pretrained_flow.pkl, not the new version of the model that I'm assuming is generated by running goldenspike (is that what's happening here?), but then that brings us to the pkl problem. Replacing this PR's pipeline's input to point to input: input: examples/goldenspike/data/pretrained_flow.pkl results in flow_modeler.out:

Traceback (most recent call last):
  File "/Users/orl/miniconda3/envs/freshrail/lib/python3.10/site-packages/tables_io/types.py", line 206, in fileType
    return FILE_FORMAT_SUFFIXS[fmt]
KeyError: 'pkl'
...
KeyError: 'Unknown file format pkl, supported types are{list(FILE_FORMAT_SUFFIXS.keys())}'
Executing stage: FlowModeler @ 2023-05-09 22:45:47.753948
Inserting handle into data store.  input: examples/goldenspike/data/pretrained_flow.pkl, flow_modeler
Stage failed: FlowModeler @ 2023-05-09 22:45:47.753998 after 0.00 minutes
eacharles commented 1 year ago

Ok, so, the inputs block is the global inputs for the pipeline. Basically this means any inputs to any stages that aren't connected to the outputs of other stages.

The original version of the pipeline started with a FlowCreator, which takes a model file that it refers to as 'flow', so that the inputs block looks like this:

inputs:
  flow: examples/goldenspike/data/pretrained_flow.pkl

The new version of the pipeline starts with a FlowModeler, which takes a data file that it refers to as "input" and creates the model file that it then passes to the FlowCreator, so in that case the global inputs looks like this:

inputs:
  input: <mypathtorail>/examples/goldenspike_examples/data/base_catalog.pq

So, for the new version of the pipeline we will want something like above.