submission: provide native support for individual "pyhf" JSON files

HEPData / hepdata

Repository for main HEPData web application

https://hepdata.net

GNU General Public License v2.0

41 stars 11 forks source link

submission: provide native support for individual "pyhf" JSON files #164

Closed GraemeWatt closed 1 year ago

GraemeWatt commented 5 years ago

To provide native support of individual pyhf JSON files, the structure of a HEPData submission would be mostly the same, i.e. tied together by a submission.yaml file containing metadata, but instead of a data file specified as:

data_file: data1.yaml

and validated by the usual "table" schema, there would be something like:

data_file: BkgOnly.json
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json

data_file: SignalGrid_sbottom.json
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/patchset.json

The first step would be to extend the hepdata-validator package to validate against the custom JSON schema specified by the data_schema key instead of the default "table" schema.

Appropriate visualization and conversion tools would need to be developed to render each "pyhf" JSON file directly in a web browser, instead of using the current tools for the "table" schema.

See also HEPData/hepdata_lib#98.

Cc: @lukasheinrich

alisonrclarke commented 4 years ago

From the hepdata-validator module's side of things:

It looks like the DataFileValidator has some mechanism already for working with custom data schemas but it is not complete:

It has a method load_custom_schema(self, type, schema_file_path=None) which accepts an optional schema path (but would not work with a remote URL).

In the validate method it also looks for a custom schema, using the 'type' field in the data:

if 'type' in data:
    custom_schema = self.load_custom_schema(data['type'])
    json_validate(data, custom_schema)

However, the schema file path argument is not used from the validate method.

I think it would be fairly straightforward to modify the validator to use custom schemas properly, and from remote URLs, without affecting the default schemas.

GraemeWatt commented 4 years ago

Before starting to work on an implementation, it would be better if the JSON schema were stable on the pyhf side. For example, the URL for workspace.json given above is broken, as is the redirection to https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json (see scikit-hep/pyhf#686). The patchset.json schema should be developed if needed (see scikit-hep/pyhf#631), noting that subdirectories are not allowed inside a HEPData tarball.

The method of telling HEPData which schema and version is being used by the data file should be decided (see scikit-hep/pyhf#740). The current hepdata-validator code is written assuming that a custom schema type is given in the data file itself, but it might be better to specify this information in the submission.yaml file, as given in my original comment above.

Cc: @lukasheinrich

lukasheinrich commented 4 years ago

adding @kratsg and @matthewfeickert.

Yes I think we'd be happy to finalize the schema to a good base version. I agree having the patchset schema would be good to have b/c then the likelihood publication maybe only includes 2 files (instead of hundreds)

Maybe @alisonrclarke can provide some instructions what you need from us. The basic funcationality of adding support for external schemas is probably independent of which those schemas entail in detail?

lukasheinrich commented 4 years ago

for visualization there is some ongoing work e.g. here

https://scikit-hep.org/pyhf/examples/notebooks/altair.html

https://www.youtube.com/watch?v=9egt9ZTm7T0

probably something based on vega / altair is sufficient for us

kratsg commented 4 years ago

Before starting to work on an implementation, it would be better if the JSON schema were stable on the pyhf side. For example, the URL for workspace.json given above is broken, as is the redirection to https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json (see scikit-hep/pyhf#686). The patchset.json schema should be developed if needed (see scikit-hep/pyhf#631), noting that subdirectories are not allowed inside a HEPData tarball.

stability is something i'm working on

The method of telling HEPData which schema and version is being used by the data file should be decided (see scikit-hep/pyhf#740). The current hepdata-validator code is written assuming that a custom schema type is given in the data file itself, but it might be better to specify this information in the submission.yaml file, as given in my original comment above.

do you need a type option in the schema? We currently don't have it, but it's not a backwards-compatible change for existing things (e.g. we can assume the default type is HistFactory if not specified for v1.0 but...) but it's a little confusing to me that you need an explicit type in the JSON when you provide a data_schema in the yaml. Can you clarify the roles?

cranmer commented 4 years ago

Hi @GraemeWatt and @alisonrclarke can we add @sinclert to the team?

GraemeWatt commented 4 years ago

I've sent an invitation to @Sinclert to join the @HEPData organization. Let me know if Write access is needed for individual repositories. You can contribute by forking repositories and making pull requests without having Write access.

The type field in the data schema was (partially) implemented by @eamonnmag via https://github.com/HEPData/hepdata-validator/commit/464862ed5e4f58fdafe4e01a210faf3a01dc399a back in 2016. The data_schema field in the submission.yaml file is not yet implemented, but it was my proposal for a better way of specifying a custom schema, then the type field in the data schema would indeed not be needed.

The idea of giving the data_schema as a URL was to avoid the need to either package the pyhf JSON schemas with the hepdata-validator or build in a requirement on the whole pyhf package. The hepdata-validator would download the data_schema when it was needed for validation. To proceed, it would be good if the URLs to the pyhf JSON schema could be fixed (scikit-hep/pyhf#686) and if the pyhf authors could provide an example HEPData submission. I'm assuming for now that one HEPData table corresponds to one pyhf JSON file specified by the data_file field in the submission.yaml file, rather than to a pyhf tarball containing many pyhf JSON files, but let me know if this assumption should be revised.

kratsg commented 4 years ago

@GraemeWatt , @sinclert -- I have provided the fix and closed the relevant issue scikit-hep/pyhf#686 via scikit-hep/pyhf#791. I would be interested to see if the data_schema is useful now that we have the schemas accessible via https (e.g. defs.json)

kratsg commented 4 years ago

Example submission.yml from the sbottom multi-b hepdata record below:

additional_resources:
- {description: Web page with auxiliary material, location: 'https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-31/'}
- {description: Truth code to compute acceptance for all signal regions using the 
    SimpleAnalysis framework, location: Sbottom_MB2018.cxx}
- {description: Archive of full likelihoods in the HistFactory JSON format described
    in ATL-PHYS-PUB-2019-029 Provided are 3 statiscal models labeled RegionA RegionB
    and RegionC respectively each in their own sub-directory. For each model the background-only
    model is found i the file named 'BkgOnly.json' For each model a set of patches
    for various signal points is provided, location: HEPData_workspaces.tar.gz}
- {description: 'slha files for the 3 baseline signal points used in the analysis
    for regions A,B,C', location: SbMB_SLHAs.tar.gz}
comment: ''
data_license: {description: The content can be shared and adapted but you must             give
    appropriate credit and cannot restrict access to others., name: cc-by-4.0, url: 'https://creativecommons.org/licenses/by/4.0/'}
hepdata_doi: 10.17182/hepdata.89408.v1

GraemeWatt commented 4 years ago

@kratsg: The example submission.yaml extract you gave above is the current situation where pyhf JSON files are attached as a tarball given as additional_resources. When I asked for an "example HEPData submission" above, I meant for the case after HEPData provides native support for individual pyhf JSON files. I imagine that you would still want to include the tarball as additional_resources, but then the submission.yaml file would contain additional YAML documents, after the normal data tables, specifying data_file as the pyhf JSON file and data_schema as the URL for the pyhf JSON schema, for example,

data_file: RegionA-BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json

I've just merged @Sinclert's PR (https://github.com/HEPData/hepdata-validator/pull/17) which extends the hepdata-validator to download JSON schema from remote locations. I've provided an example submission.yaml file for use with one of the tests. Please take a look.

You don't provide a JSON schema for the signal patch files like RegionA/patch.sbottom_1300_850_60.json, so I'm assuming that you don't want to store these in HEPData directly? Are these files meaningful in isolation (e.g. could they be visualized) or are they only meaningful when combined with the background-only workspace? Is it enough for HEPData to only support the workspace.json schema, so the uploader would need to combine each signal patch file with the background-only workspace (using jsonpatch) before specifying it as a data_file in a YAML document of the submission.yaml file?

Would you want to include all signal patch files (combined with the background-only workspace) as separate YAML documents of the submission.yaml file or just a subset of them? Including all of them would give an additional 218 (RegionA) + 219 (RegionB) + 216 (RegionC) = 653 HEPData tables (one for each pyhf JSON file) on top of the current 48 data tables. Last November @lukasheinrich proposed reducing the number of pyhf JSON files to only 6 by defining a patch set (https://github.com/scikit-hep/pyhf/issues/631), but this seems to still be an open issue. Is this still something you're planning to complete? Would you provide a JSON schema for the patch set? Could the patch set be visualized in isolation from the background-only workspace? How would you expect the submission.yaml file to look like if you adopted this approach?

Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. Please look at the recent changes to the hepdata-validator and let us know if further changes are necessary before I make a new release. Please provide an example HEPData submission that we can use for testing. Thank you in advance for your cooperation.

kratsg commented 4 years ago

@GraemeWatt , thanks for the comments. I wasn't aware you wanted an example that was more "native". I need to think more about it.

You don't provide a JSON schema for the signal patch files like RegionA/patch.sbottom_1300_850_60.json, so I'm assuming that you don't want to store these in HEPData directly? Are these files meaningful in isolation (e.g. could they be visualized) or are they only meaningful when combined with the background-only workspace? Is it enough for HEPData to only support the workspace.json schema, so the uploader would need to combine each signal patch file with the background-only workspace (using jsonpatch) before specifying it as a data_file in a YAML document of the submission.yaml file?

The reason there is no JSON schema is that they follow the jsonpatch format (see rfc6902). The BkgOnly.json should always validate against the workspace.json. The patch files describe how to modify the BkgOnly.json to inject a signal.

Would you want to include all signal patch files (combined with the background-only workspace) as separate YAML documents of the submission.yaml file or just a subset of them?

We want them all to be available for analysis. Is it possible to include all in an upload, but only display "tables" for a subset of them?

Last November @lukasheinrich proposed reducing the number of pyhf JSON files to only 6 by defining a patch set (scikit-hep/pyhf#631), but this seems to still be an open issue. Is this still something you're planning to complete? Would you provide a JSON schema for the patch set? Could the patch set be visualized in isolation from the background-only workspace? How would you expect the submission.yaml file to look like if you adopted this approach?

This still won't solve the problem since, as you mentioned, these patch files would just be usurped or merged into a single one containing multiple sets of patches where you choose which patch to apply to the BkgOnly.json. Even if we did this variation versus otherwise, we would not want users combine every single signal point before uploading.

Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. Please look at the recent changes to the hepdata-validator and let us know if further changes are necessary before I make a new release. Please provide an example HEPData submission that we can use for testing. Thank you in advance for your cooperation.

Ok, I will get back to you on this.

kratsg commented 4 years ago

Hi @GraemeWatt , to answer a lot of questions, I'm copying/pasting a proposal here that we came up with.

likelihoods:
- type: histfactory
  name: RegionA
  workspace:
    name: background-only
    description: background-only workspace
    data_file: BkgOnly.json
    data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
  patchset:
    name: patchsets
    description: sets of signal patches for the background-only workspace
    data_file: patchset.json
    data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json

A few keypoints

instead of inlining the workspaces and the patchsets (which can be obnoxiously large), we'll want data_file to resolve them and inline as this is parsed
the data_schema will point to the schema that the provided data_file should be validated against
free-form metadata name and description are generally useful
this should provide only two new tables for this example:
- summarize the mass points included in the patch set, and probably the signal yields for each point. this could be inlined and prepped as part of hepdata.
- bkgonly table would be just the yields for each background

Question:

now that we've included $schema explicitly, should type: histfactory be dropped? is this perhaps a key/field that could be used to search for hepdata entries with a histfactory type?

For HEPData, I have a question for you. Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out? We do have a pyhf inspect command that could provide information and, with the development of a patchset schema, will have support for handling patchsets in a user-friendly way.

Are patch files meaningful in isolation? Nope. A patch file (and patchset) needs the corresponding background-only. You must patch in order to get some meaningful information (in an automated and generic way).
Should the uploader combine background-only + patch? We do not want to do this. They start with signal workspaces, and make a background-only + patches, because the background-only part is repeated a lot and we want to minimize the footprint of the workspaces as a whole.
Do we want to include all files in the yaml? Yes. Using the two-table procedure I described above (a table for background-only in that region, a table for signals in that region) should be ok. It's very rare that an analysis will have THAT many regions.

kratsg commented 4 years ago

Dear @GraemeWatt , we have developed a patchset schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json .

butsuri43 commented 4 years ago

Dear @GraemeWatt ,

Following Giordon converging on workspace schemas, one of our analysis in the ATLAS SUSY group ( EWK 1L, already with v1 on HEPData: https://www.hepdata.net/record/ins1755298. Many thanks to Danika MacDonell for doing so) has prepared all the necessary files for the record upgrade with the new likelihood yaml format exploted.

~We are now eager to upload and release the new format in HEPData, but at the moment the website crashes when the new format is being used.~ see comment below

In case it helps, you can find a streamlined version of the submission tarball that we use here: https://cernbox.cern.ch/index.php/s/7lcq4OJpM9I9dz1. This contains new 'likelihoods' data file that link to the pyhf background-only workspace as well as to the signal patches. The entire submission tarball was validated with the latest hepdata-validator (v0.2.2) and a modified check.py script from HEPData attached here in a zip file: ~modified_check_py.zip.~ updated_chec.py

~~When uploading the tarball, an error: InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. is displayed in the HEPData sandbox.~~ see comment below

Please let us know if we can assist you with anything from our side.

Thanks, Krzysztof (as ATLAS SUSY HEPData coordinator)

Sinclert commented 4 years ago

Hi @butsuri43 ,

As far as I know, the newest version of hepdata-validator (v0.2.2) has not yet been included in the HEPData API.

lukasheinrich commented 4 years ago

probably we'd like to use the patchset schema as well for this upload @kratsg @butsuri43 ?

butsuri43 commented 4 years ago

I though we did use it. One thing that I realised now is that I divergent from the yaml format proposed by Giordon. Will fix that.

kratsg commented 4 years ago

@lukasheinrich it uses the patchset

likelihoods:
  type: histfactory
  name: RegionA
    workspace:
      name: background-only
      description: background-only workspace
      data_file: BkgOnly.json
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
    patchset:
      name: patchsets
      description: sets of signal patches for the background-only workspace
      data_file: patchset.json
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json

Note that data_file and data_schema are what @GraemeWatt provided, so this is what we wanted. HEPData just needs to support this structure and we would be ok.

kratsg commented 4 years ago

@danika -- one thing we need to make sure is to provide the appropriate region name (there's only one region - so i don't know what you would call it...)

lukasheinrich commented 4 years ago

@kratsg should this be a list? (not the dash on line 2 I added)

likelihoods:
- type: histfactory
  name: RegionA
    workspace:
      name: background-only
      description: background-only workspace
      data_file: BkgOnly.json
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
    patchset:
      name: patchsets
      description: sets of signal patches for the background-only workspace
      data_file: patchset.json
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json

kratsg commented 4 years ago

Yes it should be, good catch! @butsuri43 - can you check the format? It should be what @lukasheinrich just put above.

butsuri43 commented 4 years ago

Hi all,

Thanks Giordon and Lukas for checking the files. Indeed, I have realized that the format was wrong and on top of that the check.py file modifications I made were rubbish as well. I have now updated the tarball (reachable by the same link, here: https://cernbox.cern.ch/index.php/s/7lcq4OJpM9I9dz1) and the check.py (https://cernbox.cern.ch/index.php/s/mDzEIeQ89IiP1Iv) file.

More importantly, now, when uploading the tarball, the HEPData website complains with the following error message: u'independent_variables' is a required property in {'likelihoods': (...).

Now, this is more understandable error to me, and basically points to the fact that that what we have in 'likelihoodDescription.yaml' file is of the type likelihoods that is not currently implemented in hepdata-validator schemas (not even in 0.2.2). At least this is my understanding, so a new schema needs to be cooked up.

GraemeWatt commented 4 years ago

Dear Krzysztof (@butsuri43),

Thanks for the messages. I've just made a commit (https://github.com/HEPData/hepdata-validator/commit/d8f6792ec21d05898bd580e751c5989994b8100b) to the validator code to display a more user-friendly error message (There is no schema defined for the 'histfactory' data type.) instead of the generic 500 exception. The latest released (and deployed) version of the hepdata-validator is 0.2.1. Subsequent changes have been made in the master branch of the GitHub repository but 0.2.2 has not yet been released. In any case, the latest changes don't help solve your problems.

The schema introduced above by @kratsg is just a proposal that I haven't even commented on yet, let alone started to work on a code implementation. It's completely different from what I originally anticipated at the top of this issue and from what @Sinclert already implemented for a potential 0.2.2 release of the hepdata-validator package. Having two JSON files for each YAML document (a HEPData "table") of the submission.yaml file is completely incompatible with the existing HEPData structure. Modifying the hepdata-validator package is only the first step and the majority of work would be rewriting this main hepdata repository without breaking the existing functionality. I don't see that the amount of software development needed is justified by the benefits (what are they exactly?) of including the pyhf JSON files natively rather than as a tarball attachment. I intended to discuss this issue further with the pyhf developers, but they should expect to substantially contribute to code development if they really want this feature. Unfortunately, you should not expect to be able to upload individual pyhf JSON files to HEPData in the near future.

This issue is a low priority for me and other tasks will take precedence in the next few weeks, such as migrating to new infrastructure at CERN next week, then migrating from Python 2 to Python 3. But perhaps in a month or so, we can discuss if it's worthwhile to proceed further with this issue.

Best regards, Graeme Watt (HEPData Project Manager)

matthewfeickert commented 4 years ago

But perhaps in a month or so, we can discuss if it's worthwhile to proceed further with this issue.

Sounds good. We can just follow up here then.

lukasheinrich commented 4 years ago

Hi @GraemeWatt

thanks for still considering this. I think we are happy to assist where we can. As you can see

https://github.com/scikit-hep/pyhf/issues/620#issuecomment-624638428

this is being picked up by theorists, so I think effort put into this will not go in vain (I wholeheartedly support prioritizing py2->py3 first though)

I looked at the JSON record of one of the hepdata entries e.g.

https://www.hepdata.net/record/ins1755298?format=json

and see this section

{
"data": {
"csv": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/csv",
"json": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/json",
"root": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/root",
"yaml": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/yaml",
"yoda": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/yoda"
},
"description": "Signal acceptance in SR-LM high $m_{CT}$ for simplified models with $\\tilde\\chi^\\pm_1 \\tilde\\chi^0_2 \\rightarrow Wh\\tilde\\chi^0_1\\tilde\\chi^0_1, W \\rightarrow l\\nu, h \\rightarrow b\\bar{b}$...",
"doi": "10.17182/hepdata.90607.v1/t21",
"id": 476075,
"location": "Tabulated data from the publication's auxiliary material Figure 4",
"messages": false,
"name": "Acc_table_SR_LM_High_MCT",
"processed_name": "Acc_table_SR_LM_High_MCT"
}

how does this work internally? If we look at this URL: https://www.hepdata.net/record/data/90607/476075/1

I assume this is the original uploaded YAML form the submission?

I think an ideal situation would be if we could get this to show up an

{
"data": {
"json": "https://www.hepdata.net/download/table/ins1755298/MyLikelihood/json",
},
"name": "MyLikelihood",
}

and https://www.hepdata.net/download/table/ins1755298/MyLikelihood/json would resolve to the likelihood JSON.

My understanding is that for this to happen you need the likelihood as part of the overall JSON of the record. This could be easily achieved using https://json-spec.readthedocs.io/reference.html by slightly modifying the record (see {$ref: file.json} syntax

likelihoods:
- type: histfactory
  name: RegionA
    workspace:
      name: background-only
      description: background-only workspace
      data_file: {$ref: BkgOnly.json}
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
    patchset:
      name: patchsets
      description: sets of signal patches for the background-only workspace
      data_file: {$ref: BkgOnly.json}
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json

if you use this on the hepdata-validator all these files could be resolved to inline all JSON

likelihoods:
- type: histfactory
  name: RegionA
    workspace:
      name: background-only
      description: background-only workspace
      data_file: {.. big chunk of JSON ... }
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
    patchset:
      name: patchsets
      description: sets of signal patches for the background-only workspace
      data_file: {.. big chunk of JSON ... }
      data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json

and you'd never even notice it was originally in two files.

We use the same strategy in yadage (part of RECAST)

https://github.com/yadage/yadage/blob/master/tests/testspecs/mapreduce/workflow.yml

where after processing the files steps.yml and workflow.yml get fused into a big JSON doc

when viewing the JSON recorod again I think it'd be nice to (like in the tables) then provide a separte URL for that section of the record.

The benefits are the same as with not storing the tables themselves as "additional resources", e.g. you can visualize them better, etc.

butsuri43 commented 4 years ago

Hi Graeme,

Thank you very much for your response. I completely understand and agree with you that this is not the high-priority item for the HEPData team. I wrote merely to 1) express our interest, there are several analysis actively working on providing pyhf workspaces, and 2) stir-up the discussion a bit to understand what is needed from our side/ what would be the best format to minimize the effort required from the HEPData Team. I am happy to get back to this whenever convenient for you/the team. Cheers, Krzysztof

kratsg commented 4 years ago

Dear @GraemeWatt - independent of which solution we end up going with -- is it possible now to get a badge added for the analyses which currently have (and will have) HistFactory likelihoods attached to the entries? I think this is our first concern when trying to get HEPData ready to support likelihoods natively. This would massively help for identifying/searching already. We can reiterate more on the native likelihood support afterwards.

lukasheinrich commented 4 years ago

I think there is some prior art from adding Rivet badges. @GraemeWatt does it still need an external JSON file to identify the records w/ rivet analyses or is this now handled internally

Sinclert commented 4 years ago

Small update:

We just merged the PR that allows HEPData users to easily defined which remotely-defined JSON schemas are accepted from a validation point of view.

Now, from a renderization point of view, there is a not-so-clear requirement (explained by Graeme in this comment), where accepted schemas also need to be renderizable, by including the following fields in their definition:

dependent_variables.
independent_variables.

Therefore, the current list of remotely-defined accepted schemas looks like this: https://github.com/HEPData/hepdata/blob/e75194b366faf166d6a94eaf177ebd6faa52dce6/hepdata/modules/records/utils/validators.py#L37-L54

lukasheinrich commented 4 years ago

thanks @Sinclert - I think one would need to encapsulate this then. The dependent_variables, independent_variables are specific to the table schema and external data types prrobably won't be able to guarantee semantically useful mappings to thesse keys.

So if the rendering could somehow be modularized such that

table -> render with dependent_variables, independent_variables lhood -> "render using some lhood-specific data" patchset -> "render using some patchset-specific data"

GraemeWatt commented 4 years ago

@Sinclert, to clarify, I wasn't proposing that the pyhf format should be put into independent_variables and dependent_variables. As @lukasheinrich mentions, these are specific to the default table schema. Recall my comment when I opened this issue.

The first step would be to extend the hepdata-validator package to validate against the custom JSON schema specified by the data_schema key instead of the default "table" schema.

This part is done, congratulations! Unfortunately, it's the easiest part.

Appropriate visualization and conversion tools would need to be developed to render each "pyhf" JSON file directly in a web browser, instead of using the current tools for the "table" schema.

This part is still to do and is more difficult, requiring substantial parts of the HEPData code to be rewritten to support the pyhf format instead of the default "table" format. I think the next step would be for the pyhf authors to define a likelihoods.json (or likelihood.json if there will only be one likelihood per file?) schema corresponding to the format proposed by @lukasheinrich in his comment above, i.e. the workspace and patchset combined. Then the submission.yaml file would specify the likelihoods file as:

data_file: histfactory_likelihoods.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihoods.json

To proceed further, it would be good if the pyhf authors could (i) define likelihoods.json, (ii) provide an example submission, and (iii) explain how a file satisfying likelihoods.json should render in a browser, providing code if possible.

Sinclert commented 4 years ago

Oh I misunderstood then.

Aiming for a different visualization type seems to be the way to go.

cranmer commented 4 years ago

Hi all... I will be giving a talk "at CERN" for PhyStat / CERN Data Science Seminar on Oct 14 on "PHYSTAT seminar: Likelihood publishing, RECAST, and simulation-based inference" https://indico.cern.ch/event/962997/

It would be great if I can plug this development

lukasheinrich commented 4 years ago

@Sinclert @kratsg

could just start w/ something trivial:

        <div>
            <h4>pyhf likelihood</h4>
            <div id="lhood"></div>
        </div>

and

                $.getJSON('/pyhf_likelihood.json',function(d){
                $('#lhood').text(
                 `${d.channels.length} channels`
                )

and make it more complex later?

Sinclert commented 4 years ago

Let me start by saying that I think this integration is suffering deep communication problems, which have made this issue become the most commented issue (by far) of all the HEPData ecosystem of packages.

It seems to me that this discussion has transformed into some kind of messy drawer where different ideas are thrown but no real decisions are being taken. I urge future responses to address the comments placed in their previous comment so that conversation can progress.

Let's focus on what Graeme said:

To proceed further, it would be good if the pyhf authors could (i) define likelihoods.json, (ii) provide an example submission, and (iii) explain how a file satisfying likelihoods.json should render in a browser, providing code if possible reference

TLDR summary:

@lukasheinrich @kratsg @matthewfeickert please 🙏🏻

Define likelihood.json / likelihoods.json schema.
Provide a dummy submission with that newly defined schema.
Meanwhile, chose a renderization option from the draft document I created (options 4, 5 and 6 may be out the table).

@GraemeWatt please 🙏🏻

As HEPData Project Manager: would you allow to define pyhf as a direct dependency of the HEPData app?

Long explanation

1. Define the `likelihoods` schema

I think the problem with the latest submission.yml proposal, is that, defining a likelihoods.json schema that comprises both workspace.json and patchset.json inside, is way cheaper that rewritting the existing HEPData application to allow two or more JSON files per YAML entry.

As Graeme put it:

Having two JSON files for each YAML document (a HEPData "table") of the submission.yaml file is completely incompatible with the existing HEPData structure. Modifying the hepdata-validator package is only the first step and the majority of work would be rewriting this main hepdata repository without breaking the existing functionality. reference

Yes, I am aware of the hacky proposal to address this. However, if I am not mistaken, this solution will still be incompatible.

2. Provide an example submission YAML

Once the likelihoods.json schema is defined, the next step would involve creating a dummy submission file, pointing to a dummy likelihood, so that we can use that example as test data.

This dummy submission.yaml will be used to test:

If the hepdata-validator works without surprises with a remotely-defined schema [Done ✅ ].
If the visualization part of HEPData works as expected [not done yet 🚫].
If the hepdata-converter package works as expected [not done yet 🚫].

3. Decide how to render

This is a decision that does not depend on previous steps to be taken. Assuming a likelihood.json schema is defined, containing both a workspace and a patchset inside:

What visualization to you want to render? I assume the interactive visualization is the ultimate goal. As implementing this option will take time, I will add a second question:
Is there some simpler visualization I could aim for first? My initial thought was to implement a table, as it was referenced:
- Here by Giordon:
  
  this should provide only two new tables for this example: reference
- Here by Lukas:
  
  The benefits are the same as with not storing the tables themselves as "additional resources", e.g. you can visualize them better, etc. reference

Finally, please check this document I did back in June, listing the posible alternatives I came up with at that time.

4. Would HEPData use `pyhf inspect` internally?

If you have checked the draft document I created back in June, you would notice that options 4, 5 and 6 are based on the use of the pyhf inspect command to infer a summarized data table out of some pyhf schemas.

If and only if, a table is the most inmediate goal for the visualization question, could HEPData declare pyhf as a dependency to call it internally upon a pyhf schema-based submission?

As Giordon put it:

For HEPData, I have a question for you. Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out? We do have a pyhf inspect command that could provide information and, with the development of a patchset schema, will have support for handling patchsets in a user-friendly way. reference

My initial thought is that it should not, but I am not taking the decision here, @GraemeWatt is.

Conclusions

I hope decisions can be taken (single likelihood VS multiple likelihood, render style, ...) so that progress can be made.

My vision for this integration is for everyone involved to make a conscious decision of what they want, considering the development and maintenance cost of their request.

At this point, I think it is almost clear that no visualization implementation will reach production stage by for October 14th @cranmer .

Finally, I would like to rescue my favourite quote from the thread ✨

Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. reference

lukasheinrich commented 4 years ago

Hi @Sinclert ,

sorry for the mess :-/

1 + 2: I did not quite understand what problems my proposal has, maybe I forgot? I would make a small change (allowing more than one patchset for a given base likelihood, but the overal structure seems sane to me?)

3: I think a summary (Option A) is best as a first step. Ideally pyhf could provide an external JS-package (say pyhfparser.js) that parses (@kratsg what do you think) a likelihood and patchsets and spits out something like pyhf inspect and on the hepdata side it's purely about visualization. I.e. instead of taking on a pyhf dependency (which maybe is too heavy, it would be a more limited dependency (either client side as proposed here or server side (e.g. import pyhfparser)

I would propose to leave Option C for a later time once we have more experience. What I tried to convey in my last comment is that the summary does not need to replicate the full pyhf inspect command rirght away, we can start with a subset of the information of inspect

GraemeWatt commented 4 years ago

Apologies for my part in the miscommunication. I put this issue on hold back in April and left many questions unanswered due to other priorities in the last few months. Having said this, some of my existing answers above seem to have been ignored by the pyhf authors. @Sinclert, your last comment above is a great summary! I agree with everything you said there. The misunderstanding probably reflects the fact that modifying HEPData to provide native support for pyhf files is actually very complicated. Commenting in a single GitHub issue is not the best forum for discussion. I've just sent an invitation to @lukasheinrich @kratsg @matthewfeickert @cranmer (using your CERN email addresses) to join the "pyhf" stream of a Zulip instance (similar to Slack) that I was previously using to chat to @Sinclert. I'm also happy to join a video call if you want to discuss how to proceed. I'll try to answer the outstanding questions raised above.

now that we've included $schema explicitly, should type: histfactory be dropped? is this perhaps a key/field that could be used to search for hepdata entries with a histfactory type?

Yes, the type can be dropped, since it can be inferred from the data_schema key (already answered in the second paragraph of this comment).

Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out?

I'd prefer to avoid installing pyhf unless it's really necessary (already answered in the third paragraph of the same comment).

If we look at this URL: https://www.hepdata.net/record/data/90607/476075/1 I assume this is the original uploaded YAML from the submission?

No, the original YAML is processed into a JSON format suitable for rendering by the front-end JavaScript code. This processing is carried out every time a YAML data table is loaded, and it should be switched off for the pyhf format.

Is it possible now to get a badge added for the analyses which currently have (and will have) HistFactory likelihoods attached to the entries?

Yes, it's a separate issue #163, but I didn't yet find time to work on the implementation.

1 + 2: I did not quite understand what problems my proposal has, maybe I forgot? I would make a small change (allowing more than one patchset for a given base likelihood, but the overall structure seems sane to me?)

Well, I didn't explicitly point out the problem yet, so let me try to explain now. Each HEPData table corresponds to two YAML documents, (i) one in submission.yaml containing metadata validated by submission_schema.json, and (ii) a YAML data table validated by data_schema.json. As explained in the opening comment of this issue, I want to leave submission_schema.json unchanged apart from introducing the new optional data_schema key, so that only the data_file and data_schema are different for a pyhf file compared to a normal HEPData table. In your proposal, @lukasheinrich, it seems you want to introduce a new submission schema for pyhf data files as well as the data schema. This would make things much more complicated, for example, we'd need to modify the definition of the DataSubmission in the database model. So I'd like you to leave the submission schema alone, and all the pyhf-specific parts should be given in the data_file, i.e.

data_file: pyhf_likelihood.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihood.json

Here, you can use pyhf_likelihood.json instead of pyhf_likelihood.yaml if you prefer. The code should accept either YAML or JSON.

Would HEPData use pyhf inspect internally?

No, I would prefer to avoid installing pyhf unless absolutely necessary. If we needed to run pyhf inspect on-the-fly every time a table is loaded, it could slow down the rendering. Instead, I think the output of pyhf inspect should be included in the submission by the Uploader, then the front-end HEPData code could render it.

Putting my comments together, I guess the pyhf_likelihood.yaml file would look something like:

name: RegionA
workspace: {$ref: BkgOnly.json}
patchset: {$ref: patchset.json}
inspect: {$ref: inspect.json}

Here, I've removed likelihoods (assuming only one likelihood per table?), type: histfactory (redundant), name and description (since given in submission.yaml), data_file (redundant), and data_schema (since given in the https://scikit-hep.org/pyhf/schemas/1.0.0/likelihood.json schema that you will provide). This is a minimal example, but feel free to add other fields as you like, then provide the corresponding https://scikit-hep.org/pyhf/schemas/1.0.0/likelihood.json schema. I hope this makes sense! We can iterate in the Zulip chat if needed.

GraemeWatt commented 1 year ago

@lukasheinrich @kratsg @matthewfeickert @cranmer : I'm closing this issue now, as discussed in May 2022 on the HEPData Zulip chat and in-person with Lukas. To summarise, the original idea of "native support" was too complicated to be workable. A simpler approach to provide visualization for HistFactory JSON files would be to have a separate web application, possibly based on Giordon's try-pyhf app, that could import pyhf resource files from HEPData. Later we could add a "Visualize" button on relevant HEPData records linking to the pyhf visualization web app and passing the download URL of the resource file. We can start a new GitHub issue for such an integration if a pyhf visualization web app is developed.

Related to this point, I made a comment in Zulip last year:

The 20 records in the analysis:HistFactory search attach the HistFactory JSON files in different formats, which will all need to be accommodated by the new pyhf visualization web app. It would be good if the pyhf docs could contain guidance for HEPData submitters on how to best format the HistFactory JSON files uploaded to HEPData (e.g. a single .tar.gz file comprising multiple pyhf JSON pallets with an explanatory README file). This information might be already provided internally within ATLAS, but it would be better to make it publicly available for use by other experiments. (There is a Belle II record in preparation with an attached single HistFactory JSON file.)

I've just added a pyhf section to the HEPData submission docs. If you want to edit this text, e.g. by linking to pyhf docs on formatting, please submit a PR to the hepdata-submission repository.

Update: An idea emerging from the FAIROS-HEP Kick-Off Workshop in February 2023 was that if Binder could import a pyhf resource file (including a Jupyter notebook) from HEPData, that could provide a simpler way of visualizing the pyhf statistical models than developing a dedicated pyhf visualization web app.

HEPData / hepdata

submission: provide native support for individual "pyhf" JSON files #164

TLDR summary:

Long explanation

1. Define the likelihoods schema

2. Provide an example submission YAML

3. Decide how to render

4. Would HEPData use pyhf inspect internally?

Conclusions

1. Define the `likelihoods` schema

4. Would HEPData use `pyhf inspect` internally?