Closed GraemeWatt closed 1 year ago
From the hepdata-validator module's side of things:
It looks like the DataFileValidator
has some mechanism already for working with custom data schemas but it is not complete:
It has a method load_custom_schema(self, type, schema_file_path=None)
which accepts an optional schema path (but would not work with a remote URL).
In the validate
method it also looks for a custom schema, using the 'type' field in the data:
if 'type' in data:
custom_schema = self.load_custom_schema(data['type'])
json_validate(data, custom_schema)
However, the schema file path argument is not used from the validate method.
I think it would be fairly straightforward to modify the validator to use custom schemas properly, and from remote URLs, without affecting the default schemas.
Before starting to work on an implementation, it would be better if the JSON schema were stable on the pyhf side. For example, the URL for workspace.json
given above is broken, as is the redirection to https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json (see scikit-hep/pyhf#686). The patchset.json
schema should be developed if needed (see scikit-hep/pyhf#631), noting that subdirectories are not allowed inside a HEPData tarball.
The method of telling HEPData which schema and version is being used by the data file should be decided (see scikit-hep/pyhf#740). The current hepdata-validator code is written assuming that a custom schema type
is given in the data file itself, but it might be better to specify this information in the submission.yaml
file, as given in my original comment above.
Cc: @lukasheinrich
adding @kratsg and @matthewfeickert.
Yes I think we'd be happy to finalize the schema to a good base version. I agree having the patchset schema would be good to have b/c then the likelihood publication maybe only includes 2 files (instead of hundreds)
Maybe @alisonrclarke can provide some instructions what you need from us. The basic funcationality of adding support for external schemas is probably independent of which those schemas entail in detail?
for visualization there is some ongoing work e.g. here
https://scikit-hep.org/pyhf/examples/notebooks/altair.html
https://www.youtube.com/watch?v=9egt9ZTm7T0
probably something based on vega / altair is sufficient for us
Before starting to work on an implementation, it would be better if the JSON schema were stable on the pyhf side. For example, the URL for
workspace.json
given above is broken, as is the redirection to https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json (see scikit-hep/pyhf#686). Thepatchset.json
schema should be developed if needed (see scikit-hep/pyhf#631), noting that subdirectories are not allowed inside a HEPData tarball.
stability is something i'm working on
The method of telling HEPData which schema and version is being used by the data file should be decided (see scikit-hep/pyhf#740). The current hepdata-validator code is written assuming that a custom schema
type
is given in the data file itself, but it might be better to specify this information in thesubmission.yaml
file, as given in my original comment above.
do you need a type
option in the schema? We currently don't have it, but it's not a backwards-compatible change for existing things (e.g. we can assume the default type is HistFactory if not specified for v1.0 but...) but it's a little confusing to me that you need an explicit type in the JSON when you provide a data_schema
in the yaml. Can you clarify the roles?
Hi @GraemeWatt and @alisonrclarke can we add @sinclert to the team?
I've sent an invitation to @Sinclert to join the @HEPData organization. Let me know if Write access is needed for individual repositories. You can contribute by forking repositories and making pull requests without having Write access.
The type
field in the data schema was (partially) implemented by @eamonnmag via https://github.com/HEPData/hepdata-validator/commit/464862ed5e4f58fdafe4e01a210faf3a01dc399a back in 2016. The data_schema
field in the submission.yaml
file is not yet implemented, but it was my proposal for a better way of specifying a custom schema, then the type
field in the data schema would indeed not be needed.
The idea of giving the data_schema
as a URL was to avoid the need to either package the pyhf JSON schemas with the hepdata-validator or build in a requirement on the whole pyhf package. The hepdata-validator would download the data_schema
when it was needed for validation. To proceed, it would be good if the URLs to the pyhf JSON schema could be fixed (scikit-hep/pyhf#686) and if the pyhf authors could provide an example HEPData submission. I'm assuming for now that one HEPData table corresponds to one pyhf JSON file specified by the data_file
field in the submission.yaml
file, rather than to a pyhf tarball containing many pyhf JSON files, but let me know if this assumption should be revised.
@GraemeWatt , @sinclert -- I have provided the fix and closed the relevant issue scikit-hep/pyhf#686 via scikit-hep/pyhf#791. I would be interested to see if the data_schema
is useful now that we have the schemas accessible via https (e.g. defs.json)
Example submission.yml
from the sbottom multi-b hepdata record below:
additional_resources:
- {description: Web page with auxiliary material, location: 'https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-31/'}
- {description: Truth code to compute acceptance for all signal regions using the
SimpleAnalysis framework, location: Sbottom_MB2018.cxx}
- {description: Archive of full likelihoods in the HistFactory JSON format described
in ATL-PHYS-PUB-2019-029 Provided are 3 statiscal models labeled RegionA RegionB
and RegionC respectively each in their own sub-directory. For each model the background-only
model is found i the file named 'BkgOnly.json' For each model a set of patches
for various signal points is provided, location: HEPData_workspaces.tar.gz}
- {description: 'slha files for the 3 baseline signal points used in the analysis
for regions A,B,C', location: SbMB_SLHAs.tar.gz}
comment: ''
data_license: {description: The content can be shared and adapted but you must give
appropriate credit and cannot restrict access to others., name: cc-by-4.0, url: 'https://creativecommons.org/licenses/by/4.0/'}
hepdata_doi: 10.17182/hepdata.89408.v1
@kratsg: The example submission.yaml
extract you gave above is the current situation where pyhf JSON files are attached as a tarball given as additional_resources
. When I asked for an "example HEPData submission" above, I meant for the case after HEPData provides native support for individual pyhf JSON files. I imagine that you would still want to include the tarball as additional_resources
, but then the submission.yaml
file would contain additional YAML documents, after the normal data tables, specifying data_file
as the pyhf JSON file and data_schema
as the URL for the pyhf JSON schema, for example,
data_file: RegionA-BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
I've just merged @Sinclert's PR (https://github.com/HEPData/hepdata-validator/pull/17) which extends the hepdata-validator to download JSON schema from remote locations. I've provided an example submission.yaml file for use with one of the tests. Please take a look.
You don't provide a JSON schema for the signal patch files like RegionA/patch.sbottom_1300_850_60.json
, so I'm assuming that you don't want to store these in HEPData directly? Are these files meaningful in isolation (e.g. could they be visualized) or are they only meaningful when combined with the background-only workspace? Is it enough for HEPData to only support the workspace.json
schema, so the uploader would need to combine each signal patch file with the background-only workspace (using jsonpatch
) before specifying it as a data_file
in a YAML document of the submission.yaml
file?
Would you want to include all signal patch files (combined with the background-only workspace) as separate YAML documents of the submission.yaml
file or just a subset of them? Including all of them would give an additional 218 (RegionA) + 219 (RegionB) + 216 (RegionC) = 653 HEPData tables (one for each pyhf JSON file) on top of the current 48 data tables. Last November @lukasheinrich proposed reducing the number of pyhf JSON files to only 6 by defining a patch set (https://github.com/scikit-hep/pyhf/issues/631), but this seems to still be an open issue. Is this still something you're planning to complete? Would you provide a JSON schema for the patch set? Could the patch set be visualized in isolation from the background-only workspace? How would you expect the submission.yaml
file to look like if you adopted this approach?
Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. Please look at the recent changes to the hepdata-validator and let us know if further changes are necessary before I make a new release. Please provide an example HEPData submission that we can use for testing. Thank you in advance for your cooperation.
@GraemeWatt , thanks for the comments. I wasn't aware you wanted an example that was more "native". I need to think more about it.
You don't provide a JSON schema for the signal patch files like
RegionA/patch.sbottom_1300_850_60.json
, so I'm assuming that you don't want to store these in HEPData directly? Are these files meaningful in isolation (e.g. could they be visualized) or are they only meaningful when combined with the background-only workspace? Is it enough for HEPData to only support theworkspace.json
schema, so the uploader would need to combine each signal patch file with the background-only workspace (usingjsonpatch
) before specifying it as adata_file
in a YAML document of thesubmission.yaml
file?
The reason there is no JSON schema is that they follow the jsonpatch format (see rfc6902). The BkgOnly.json
should always validate against the workspace.json
. The patch files describe how to modify the BkgOnly.json
to inject a signal.
Would you want to include all signal patch files (combined with the background-only workspace) as separate YAML documents of the submission.yaml file or just a subset of them?
We want them all to be available for analysis. Is it possible to include all in an upload, but only display "tables" for a subset of them?
Last November @lukasheinrich proposed reducing the number of pyhf JSON files to only 6 by defining a patch set (scikit-hep/pyhf#631), but this seems to still be an open issue. Is this still something you're planning to complete? Would you provide a JSON schema for the patch set? Could the patch set be visualized in isolation from the background-only workspace? How would you expect the
submission.yaml
file to look like if you adopted this approach?
This still won't solve the problem since, as you mentioned, these patch files would just be usurped or merged into a single one containing multiple sets of patches where you choose which patch to apply to the BkgOnly.json
. Even if we did this variation versus otherwise, we would not want users combine every single signal point before uploading.
Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. Please look at the recent changes to the hepdata-validator and let us know if further changes are necessary before I make a new release. Please provide an example HEPData submission that we can use for testing. Thank you in advance for your cooperation.
Ok, I will get back to you on this.
Hi @GraemeWatt , to answer a lot of questions, I'm copying/pasting a proposal here that we came up with.
likelihoods:
- type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: patchset.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json
A few keypoints
data_file
to resolve them and inline as this is parseddata_schema
will point to the schema that the provided data_file
should be validated againstname
and description
are generally usefulQuestion:
$schema
explicitly, should type: histfactory
be dropped? is this perhaps a key/field that could be used to search for hepdata entries with a histfactory
type?For HEPData, I have a question for you. Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out? We do have a pyhf inspect
command that could provide information and, with the development of a patchset schema, will have support for handling patchsets in a user-friendly way.
Dear @GraemeWatt , we have developed a patchset schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json .
Dear @GraemeWatt ,
Following Giordon converging on workspace schemas, one of our analysis in the ATLAS SUSY group ( EWK 1L, already with v1 on HEPData: https://www.hepdata.net/record/ins1755298. Many thanks to Danika MacDonell for doing so) has prepared all the necessary files for the record upgrade with the new likelihood yaml format exploted.
~We are now eager to upload and release the new format in HEPData, but at the moment the website crashes when the new format is being used.~ see comment below
In case it helps, you can find a streamlined version of the submission tarball that we use here: https://cernbox.cern.ch/index.php/s/7lcq4OJpM9I9dz1. This contains new 'likelihoods' data file that link to the pyhf background-only workspace as well as to the signal patches. The entire submission tarball was validated with the latest hepdata-validator (v0.2.2) and a modified check.py script from HEPData attached here in a zip file: ~modified_check_py.zip.~ updated_chec.py
~~When uploading the tarball, an error:
InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
is displayed in the HEPData sandbox.~~ see comment below
Please let us know if we can assist you with anything from our side.
Thanks, Krzysztof (as ATLAS SUSY HEPData coordinator)
Hi @butsuri43 ,
As far as I know, the newest version of hepdata-validator
(v0.2.2) has not yet been included in the HEPData API.
probably we'd like to use the patchset schema as well for this upload @kratsg @butsuri43 ?
I though we did use it. One thing that I realised now is that I divergent from the yaml format proposed by Giordon. Will fix that.
@lukasheinrich it uses the patchset
likelihoods:
type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: patchset.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json
Note that data_file
and data_schema
are what @GraemeWatt provided, so this is what we wanted. HEPData just needs to support this structure and we would be ok.
@danika -- one thing we need to make sure is to provide the appropriate region name (there's only one region - so i don't know what you would call it...)
@kratsg should this be a list? (not the dash on line 2 I added)
likelihoods:
- type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: BkgOnly.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: patchset.json
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json
Yes it should be, good catch! @butsuri43 - can you check the format? It should be what @lukasheinrich just put above.
Hi all,
Thanks Giordon and Lukas for checking the files. Indeed, I have realized that the format was wrong and on top of that the check.py file modifications I made were rubbish as well. I have now updated the tarball (reachable by the same link, here: https://cernbox.cern.ch/index.php/s/7lcq4OJpM9I9dz1) and the check.py (https://cernbox.cern.ch/index.php/s/mDzEIeQ89IiP1Iv) file.
More importantly, now, when uploading the tarball, the HEPData website complains with the following error message:
u'independent_variables' is a required property in {'likelihoods': (...)
.
Now, this is more understandable error to me, and basically points to the fact that that what we have in 'likelihoodDescription.yaml' file is of the type likelihoods
that is not currently implemented in hepdata-validator schemas (not even in 0.2.2). At least this is my understanding, so a new schema needs to be cooked up.
Dear Krzysztof (@butsuri43),
Thanks for the messages. I've just made a commit (https://github.com/HEPData/hepdata-validator/commit/d8f6792ec21d05898bd580e751c5989994b8100b) to the validator code to display a more user-friendly error message (There is no schema defined for the 'histfactory' data type.
) instead of the generic 500 exception. The latest released (and deployed) version of the hepdata-validator
is 0.2.1. Subsequent changes have been made in the master
branch of the GitHub repository but 0.2.2 has not yet been released. In any case, the latest changes don't help solve your problems.
The schema introduced above by @kratsg is just a proposal that I haven't even commented on yet, let alone started to work on a code implementation. It's completely different from what I originally anticipated at the top of this issue and from what @Sinclert already implemented for a potential 0.2.2 release of the hepdata-validator
package. Having two JSON files for each YAML document (a HEPData "table") of the submission.yaml
file is completely incompatible with the existing HEPData structure. Modifying the hepdata-validator
package is only the first step and the majority of work would be rewriting this main hepdata
repository without breaking the existing functionality. I don't see that the amount of software development needed is justified by the benefits (what are they exactly?) of including the pyhf JSON files natively rather than as a tarball attachment. I intended to discuss this issue further with the pyhf developers, but they should expect to substantially contribute to code development if they really want this feature. Unfortunately, you should not expect to be able to upload individual pyhf JSON files to HEPData in the near future.
This issue is a low priority for me and other tasks will take precedence in the next few weeks, such as migrating to new infrastructure at CERN next week, then migrating from Python 2 to Python 3. But perhaps in a month or so, we can discuss if it's worthwhile to proceed further with this issue.
Best regards, Graeme Watt (HEPData Project Manager)
But perhaps in a month or so, we can discuss if it's worthwhile to proceed further with this issue.
Sounds good. We can just follow up here then.
Hi @GraemeWatt
thanks for still considering this. I think we are happy to assist where we can. As you can see
https://github.com/scikit-hep/pyhf/issues/620#issuecomment-624638428
this is being picked up by theorists, so I think effort put into this will not go in vain (I wholeheartedly support prioritizing py2->py3 first though)
I looked at the JSON record of one of the hepdata entries e.g.
https://www.hepdata.net/record/ins1755298?format=json
and see this section
{
"data": {
"csv": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/csv",
"json": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/json",
"root": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/root",
"yaml": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/yaml",
"yoda": "https://www.hepdata.net/download/table/ins1755298/Acc_table_SR_LM_High_MCT/yoda"
},
"description": "Signal acceptance in SR-LM high $m_{CT}$ for simplified models with $\\tilde\\chi^\\pm_1 \\tilde\\chi^0_2 \\rightarrow Wh\\tilde\\chi^0_1\\tilde\\chi^0_1, W \\rightarrow l\\nu, h \\rightarrow b\\bar{b}$...",
"doi": "10.17182/hepdata.90607.v1/t21",
"id": 476075,
"location": "Tabulated data from the publication's auxiliary material Figure 4",
"messages": false,
"name": "Acc_table_SR_LM_High_MCT",
"processed_name": "Acc_table_SR_LM_High_MCT"
}
how does this work internally? If we look at this URL: https://www.hepdata.net/record/data/90607/476075/1
I assume this is the original uploaded YAML form the submission?
I think an ideal situation would be if we could get this to show up an
{
"data": {
"json": "https://www.hepdata.net/download/table/ins1755298/MyLikelihood/json",
},
"name": "MyLikelihood",
}
and https://www.hepdata.net/download/table/ins1755298/MyLikelihood/json
would resolve to the likelihood JSON.
My understanding is that for this to happen you need the likelihood as part of the overall JSON of the record. This could be easily achieved using https://json-spec.readthedocs.io/reference.html by slightly modifying the record (see {$ref: file.json}
syntax
likelihoods:
- type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: {$ref: BkgOnly.json}
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: {$ref: BkgOnly.json}
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json
if you use this on the hepdata-validator all these files could be resolved to inline all JSON
likelihoods:
- type: histfactory
name: RegionA
workspace:
name: background-only
description: background-only workspace
data_file: {.. big chunk of JSON ... }
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/workspace.json
patchset:
name: patchsets
description: sets of signal patches for the background-only workspace
data_file: {.. big chunk of JSON ... }
data_schema: https://scikit-hep.org/pyhf/schemas/1.0.0/patchset.json
and you'd never even notice it was originally in two files.
We use the same strategy in yadage (part of RECAST)
https://github.com/yadage/yadage/blob/master/tests/testspecs/mapreduce/workflow.yml
where after processing the files steps.yml and workflow.yml get fused into a big JSON doc
when viewing the JSON recorod again I think it'd be nice to (like in the tables) then provide a separte URL for that section of the record.
The benefits are the same as with not storing the tables themselves as "additional resources", e.g. you can visualize them better, etc.
Hi Graeme,
Thank you very much for your response. I completely understand and agree with you that this is not the high-priority item for the HEPData team. I wrote merely to 1) express our interest, there are several analysis actively working on providing pyhf workspaces, and 2) stir-up the discussion a bit to understand what is needed from our side/ what would be the best format to minimize the effort required from the HEPData Team. I am happy to get back to this whenever convenient for you/the team. Cheers, Krzysztof
Dear @GraemeWatt - independent of which solution we end up going with -- is it possible now to get a badge added for the analyses which currently have (and will have) HistFactory likelihoods attached to the entries? I think this is our first concern when trying to get HEPData ready to support likelihoods natively. This would massively help for identifying/searching already. We can reiterate more on the native likelihood support afterwards.
I think there is some prior art from adding Rivet badges. @GraemeWatt does it still need an external JSON file to identify the records w/ rivet analyses or is this now handled internally
Small update:
We just merged the PR that allows HEPData users to easily defined which remotely-defined JSON schemas are accepted from a validation point of view.
Now, from a renderization point of view, there is a not-so-clear requirement (explained by Graeme in this comment), where accepted schemas also need to be renderizable, by including the following fields in their definition:
dependent_variables
.independent_variables
.Therefore, the current list of remotely-defined accepted schemas looks like this: https://github.com/HEPData/hepdata/blob/e75194b366faf166d6a94eaf177ebd6faa52dce6/hepdata/modules/records/utils/validators.py#L37-L54
thanks @Sinclert - I think one would need to encapsulate this then. The dependent_variables, independent_variables are specific to the table schema and external data types prrobably won't be able to guarantee semantically useful mappings to thesse keys.
So if the rendering could somehow be modularized such that
table -> render with dependent_variables, independent_variables
lhood -> "render using some lhood-specific data"
patchset -> "render using some patchset-specific data"
@Sinclert, to clarify, I wasn't proposing that the pyhf format should be put into independent_variables
and dependent_variables
. As @lukasheinrich mentions, these are specific to the default table schema. Recall my comment when I opened this issue.
The first step would be to extend the
hepdata-validator
package to validate against the custom JSON schema specified by thedata_schema
key instead of the default "table" schema.
This part is done, congratulations! Unfortunately, it's the easiest part.
Appropriate visualization and conversion tools would need to be developed to render each "pyhf" JSON file directly in a web browser, instead of using the current tools for the "table" schema.
This part is still to do and is more difficult, requiring substantial parts of the HEPData code to be rewritten to support the pyhf format instead of the default "table" format. I think the next step would be for the pyhf authors to define a likelihoods.json
(or likelihood.json
if there will only be one likelihood per file?) schema corresponding to the format proposed by @lukasheinrich in his comment above, i.e. the workspace and patchset combined. Then the submission.yaml
file would specify the likelihoods file as:
data_file: histfactory_likelihoods.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihoods.json
To proceed further, it would be good if the pyhf authors could (i) define likelihoods.json
, (ii) provide an example submission, and (iii) explain how a file satisfying likelihoods.json
should render in a browser, providing code if possible.
Oh I misunderstood then.
Aiming for a different visualization type seems to be the way to go.
Hi all... I will be giving a talk "at CERN" for PhyStat / CERN Data Science Seminar on Oct 14 on "PHYSTAT seminar: Likelihood publishing, RECAST, and simulation-based inference" https://indico.cern.ch/event/962997/
It would be great if I can plug this development
@Sinclert @kratsg
could just start w/ something trivial:
<div>
<h4>pyhf likelihood</h4>
<div id="lhood"></div>
</div>
and
$.getJSON('/pyhf_likelihood.json',function(d){
$('#lhood').text(
`${d.channels.length} channels`
)
and make it more complex later?
Let me start by saying that I think this integration is suffering deep communication problems, which have made this issue become the most commented issue (by far) of all the HEPData ecosystem of packages.
It seems to me that this discussion has transformed into some kind of messy drawer where different ideas are thrown but no real decisions are being taken. I urge future responses to address the comments placed in their previous comment so that conversation can progress.
Let's focus on what Graeme said:
To proceed further, it would be good if the pyhf authors could (i) define
likelihoods.json
, (ii) provide an example submission, and (iii) explain how a file satisfyinglikelihoods.json
should render in a browser, providing code if possible reference
@lukasheinrich @kratsg @matthewfeickert please 🙏🏻
likelihood.json
/ likelihoods.json
schema.@GraemeWatt please 🙏🏻
pyhf
as a direct dependency of the HEPData app?likelihoods
schemaI think the problem with the latest submission.yml
proposal, is that, defining a likelihoods.json
schema that comprises both workspace.json
and patchset.json
inside, is way cheaper that rewritting the existing HEPData application to allow two or more JSON files per YAML entry.
As Graeme put it:
Having two JSON files for each YAML document (a HEPData "table") of the submission.yaml file is completely incompatible with the existing HEPData structure. Modifying the hepdata-validator package is only the first step and the majority of work would be rewriting this main hepdata repository without breaking the existing functionality. reference
Yes, I am aware of the hacky proposal to address this. However, if I am not mistaken, this solution will still be incompatible.
Once the likelihoods.json
schema is defined, the next step would involve creating a dummy submission file, pointing to a dummy likelihood, so that we can use that example as test data.
This dummy submission.yaml
will be used to test:
This is a decision that does not depend on previous steps to be taken. Assuming a likelihood.json
schema is defined, containing both a workspace and a patchset inside:
Is there some simpler visualization I could aim for first? My initial thought was to implement a table, as it was referenced:
Finally, please check this document I did back in June, listing the posible alternatives I came up with at that time.
pyhf inspect
internally?If you have checked the draft document I created back in June, you would notice that options 4, 5 and 6 are based on the use of the pyhf inspect
command to infer a summarized data table out of some pyhf schemas.
If and only if, a table is the most inmediate goal for the visualization question, could HEPData declare pyhf
as a dependency to call it internally upon a pyhf
schema-based submission?
As Giordon put it:
For HEPData, I have a question for you. Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out? We do have a pyhf inspect command that could provide information and, with the development of a patchset schema, will have support for handling patchsets in a user-friendly way. reference
My initial thought is that it should not, but I am not taking the decision here, @GraemeWatt is.
I hope decisions can be taken (single likelihood VS multiple likelihood, render style, ...) so that progress can be made.
My vision for this integration is for everyone involved to make a conscious decision of what they want, considering the development and maintenance cost of their request.
At this point, I think it is almost clear that no visualization implementation will reach production stage by for October 14th @cranmer .
Finally, I would like to rescue my favourite quote from the thread ✨
Please be more precise exactly how you expect HEPData to natively support pyhf JSON files to avoid @Sinclert spending time on unnecessary software development. reference
Hi @Sinclert ,
sorry for the mess :-/
1 + 2: I did not quite understand what problems my proposal has, maybe I forgot? I would make a small change (allowing more than one patchset for a given base likelihood, but the overal structure seems sane to me?)
3: I think a summary (Option A) is best as a first step. Ideally pyhf could provide an external JS-package (say pyhfparser.js
) that parses (@kratsg what do you think) a likelihood and patchsets and spits out something like pyhf inspect
and on the hepdata side it's purely about visualization. I.e. instead of taking on a pyhf
dependency (which maybe is too heavy, it would be a more limited dependency (either client side as proposed here or server side (e.g. import pyhfparser
)
I would propose to leave Option C for a later time once we have more experience. What I tried to convey in my last comment is that the summary does not need to replicate the full pyhf inspect
command rirght away, we can start with a subset of the information of inspect
Apologies for my part in the miscommunication. I put this issue on hold back in April and left many questions unanswered due to other priorities in the last few months. Having said this, some of my existing answers above seem to have been ignored by the pyhf authors. @Sinclert, your last comment above is a great summary! I agree with everything you said there. The misunderstanding probably reflects the fact that modifying HEPData to provide native support for pyhf files is actually very complicated. Commenting in a single GitHub issue is not the best forum for discussion. I've just sent an invitation to @lukasheinrich @kratsg @matthewfeickert @cranmer (using your CERN email addresses) to join the "pyhf" stream of a Zulip instance (similar to Slack) that I was previously using to chat to @Sinclert. I'm also happy to join a video call if you want to discuss how to proceed. I'll try to answer the outstanding questions raised above.
now that we've included
$schema
explicitly, shouldtype: histfactory
be dropped? is this perhaps a key/field that could be used to search for hepdata entries with ahistfactory
type?
Yes, the type
can be dropped, since it can be inferred from the data_schema
key (already answered in the second paragraph of this comment).
Is HEPData open to having pyhf installed to parse the background-only / patchsets (apply patches to the background-only as well) in order to get information out?
I'd prefer to avoid installing pyhf
unless it's really necessary (already answered in the third paragraph of the same comment).
If we look at this URL: https://www.hepdata.net/record/data/90607/476075/1 I assume this is the original uploaded YAML from the submission?
No, the original YAML is processed into a JSON format suitable for rendering by the front-end JavaScript code. This processing is carried out every time a YAML data table is loaded, and it should be switched off for the pyhf format.
Is it possible now to get a badge added for the analyses which currently have (and will have) HistFactory likelihoods attached to the entries?
Yes, it's a separate issue #163, but I didn't yet find time to work on the implementation.
1 + 2: I did not quite understand what problems my proposal has, maybe I forgot? I would make a small change (allowing more than one patchset for a given base likelihood, but the overall structure seems sane to me?)
Well, I didn't explicitly point out the problem yet, so let me try to explain now. Each HEPData table corresponds to two YAML documents, (i) one in submission.yaml
containing metadata validated by submission_schema.json, and (ii) a YAML data table validated by data_schema.json. As explained in the opening comment of this issue, I want to leave submission_schema.json
unchanged apart from introducing the new optional data_schema
key, so that only the data_file
and data_schema
are different for a pyhf file compared to a normal HEPData table. In your proposal, @lukasheinrich, it seems you want to introduce a new submission schema for pyhf data files as well as the data schema. This would make things much more complicated, for example, we'd need to modify the definition of the DataSubmission in the database model. So I'd like you to leave the submission schema alone, and all the pyhf-specific parts should be given in the data_file
, i.e.
data_file: pyhf_likelihood.yaml
data_schema: https://diana-hep.org/pyhf/schemas/1.0.0/likelihood.json
Here, you can use pyhf_likelihood.json
instead of pyhf_likelihood.yaml
if you prefer. The code should accept either YAML or JSON.
Would HEPData use
pyhf inspect
internally?
No, I would prefer to avoid installing pyhf
unless absolutely necessary. If we needed to run pyhf inspect
on-the-fly every time a table is loaded, it could slow down the rendering. Instead, I think the output of pyhf inspect
should be included in the submission by the Uploader, then the front-end HEPData code could render it.
Putting my comments together, I guess the pyhf_likelihood.yaml
file would look something like:
name: RegionA
workspace: {$ref: BkgOnly.json}
patchset: {$ref: patchset.json}
inspect: {$ref: inspect.json}
Here, I've removed likelihoods
(assuming only one likelihood per table?), type: histfactory
(redundant), name
and description
(since given in submission.yaml
), data_file
(redundant), and data_schema
(since given in the https://scikit-hep.org/pyhf/schemas/1.0.0/likelihood.json
schema that you will provide). This is a minimal example, but feel free to add other fields as you like, then provide the corresponding https://scikit-hep.org/pyhf/schemas/1.0.0/likelihood.json
schema. I hope this makes sense! We can iterate in the Zulip chat if needed.
@lukasheinrich @kratsg @matthewfeickert @cranmer : I'm closing this issue now, as discussed in May 2022 on the HEPData Zulip chat and in-person with Lukas. To summarise, the original idea of "native support" was too complicated to be workable. A simpler approach to provide visualization for HistFactory JSON files would be to have a separate web application, possibly based on Giordon's try-pyhf
app, that could import pyhf resource files from HEPData. Later we could add a "Visualize" button on relevant HEPData records linking to the pyhf visualization web app and passing the download URL of the resource file. We can start a new GitHub issue for such an integration if a pyhf visualization web app is developed.
Related to this point, I made a comment in Zulip last year:
The 20 records in the analysis:HistFactory search attach the HistFactory JSON files in different formats, which will all need to be accommodated by the new pyhf visualization web app. It would be good if the pyhf docs could contain guidance for HEPData submitters on how to best format the HistFactory JSON files uploaded to HEPData (e.g. a single .tar.gz file comprising multiple pyhf JSON pallets with an explanatory README file). This information might be already provided internally within ATLAS, but it would be better to make it publicly available for use by other experiments. (There is a Belle II record in preparation with an attached single HistFactory JSON file.)
I've just added a pyhf section to the HEPData submission docs. If you want to edit this text, e.g. by linking to pyhf docs on formatting, please submit a PR to the hepdata-submission repository.
Update: An idea emerging from the FAIROS-HEP Kick-Off Workshop in February 2023 was that if Binder could import a pyhf resource file (including a Jupyter notebook) from HEPData, that could provide a simpler way of visualizing the pyhf statistical models than developing a dedicated pyhf visualization web app.
To provide native support of individual pyhf JSON files, the structure of a HEPData submission would be mostly the same, i.e. tied together by a
submission.yaml
file containing metadata, but instead of a data file specified as:and validated by the usual "table" schema, there would be something like:
or
The first step would be to extend the
hepdata-validator
package to validate against the custom JSON schema specified by thedata_schema
key instead of the default "table" schema.Appropriate visualization and conversion tools would need to be developed to render each "pyhf" JSON file directly in a web browser, instead of using the current tools for the "table" schema.
See also HEPData/hepdata_lib#98.
Cc: @lukasheinrich