Closed glevava closed 2 years ago
Hi @glevava,
I've tested this with CMOR v3.6.1 and I'm seeing segfaults if there is more than one entry in the list (I thought I'd tested this before the change to the CVs file was merged). We have tried this with the one of the testing builds on the PCMDI conda channel, cmor=3.6.1.2022.03.14.18.32.ga7f960d with python 3.8 (see here), and it worked fine, but we haven't been able to understand the change that leads to this behaving itself.
CMOR v3.7.0 is fairly close as I understand it -- would updating to use this, once released, be an option? (I appreciate that this will probably require updates across all active ESGF nodes).
I'm not sure this is due to the number of entries in the list as I have now two items in the list and it works like a charm. Anyway, I tried to install the latest CMOR stable release but I gave up because we are at the end of CMIP6, security policy of our HPC don't provide internet access (so I have to manually transfer all CMOR binaries and dependencies) and finally it appeared to me modifying the JSON was a simpler workaround ; )
In the future (CMIP7?), we will change our workflow and the new release of CMOR could be an option yes. Nevertheless, I must admit we didn't use CMOR at IPSL nor at CNRM, our GCMs directly produce standardized CMIP data. So we only need CMOR to run PrePARE which is quite overkill. It would be appreciated to have PrePARE as standalone tool maybe in full Python. This would be also in line with a controlled vocabulary service we are thinking about in the ESGF in order to check and provide metadata at the different steps of the publication process.
@glevava just an FYI, we (@matthew-mizielinski, @sashakames, @taylor13) have started thinking about how to cleanup what we currently have so to move forward in a more modular (and non-duplicated) way, a sketch of the potential path can be found in https://github.com/WCRP-CMIP/CMIP6Plus_CVs/issues/1#issuecomment-1182574956 - happy to hear your views on this (and ask away if the vagueness of the info requires more detail)
@durack1 The link in your post above doesn't work.
@mauzey1 thanks, apologies this is a currently private repo, will add you and @glevava
I've just had a slightly deeper dig into this as I think it is affecting ESGF publication at CEDA too, and I may have a possible mechanism for why this is failing at v3.6.1 (not sure why it is working with the nightly builds, unless there is a change in one of the limits);
I've had a look at the way in which PrePARE is failing and by trial and error found that if I shorten the license strings such that the total number of characters across all four entries in the list is less than 1008 then PrePARE works fine (I'm guessing that there is a 1024 byte limit on the list holding the licenses and there is a ~5 byte cost for the separation of entries in the list -- adding an extra entry to the list seems to segfault at 1003 characters total).
While I think that having the multiple patterns is by far the most elegant solution, but we could squeeze the four patterns we have into one more general one. The pattern below (line breaks inserted for readability) would cover the four cases, but it does leave the license text more open to typos.
"^CMIP6 model data produced by .* is licensed under a Creative Commons .* License (https://creativecommons\\.org/.*)\\.
*Consult https://pcmdi\\.llnl\\.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation
requirements and proper acknowledgment\\. *Further information about this data, including some limitations, can be found via
the further_info_url (recorded as a global attribute in this file).*\\. *The data producers and data providers make no warranty,
either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose\\. *All
liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest
extent permitted by law\\.$"
A quick google suggests that the regex C library being used here is very limited in its capabilities, so I haven't been able to work out how to construct a suitable pattern with the different options rather than using .*
a couple of times. There is a library with a greater regex capability (PCRE2) that we could look at for future work.
@durack1 @mauzey1 @taylor13
Would it be worth making a change to work within (what I assume to be -- I could be wrong) the character limit for the license text in this repository now to ensure that ESGF publication chains can pick this up without having to wait for the next version of CMOR? We can then put the more elegant solution into post CMIP6 work.
@sashakames, if you have a moment could you have a look and see whether the same issue is affecting LLNL publication?
thanks @matthew-mizielinski for your efforts to diagnose and provide a likely explanation for the problem. If there is some urgency to fix this problem, I would do as you say. I wouldn't think treating the license string as a special case that could exceed 1024 characters would be that difficult, but @mauzey1 could advise. He can tell us whether we should wait for the more elegant solution or implement the quicker fix you suggested.
@matthew-mizielinski thanks for this, in an offline discussion with @mauzey1 last night we came to the same conclusion that updating the license check string, from 4 to 1 should solve this problem, and ensure that folks using the latest version of the tables can use these with CMOR 3.6.1 and previous. What you have proposed above https://github.com/PCMDI/cmip6-cmor-tables/issues/376#issuecomment-1199179966 (768 chars) is where I was headed.
The nightly builds have tweaks in place that deal with this license change, @mauzey1 can elaborate if required.
I believe this example is a primary reason why we need to move toward a CMOR4, to solve the string limits, but not break anything with CMIP6 that has been successfully using CMOR3.x
@durack1 @matthew-mizielinski
The issue with using the 4 license templates with older versions of CMOR/PrePARE is that whenever a license is invalid, such as when using older licenses that don't match any of the 4 templates, CMOR will try creating an error message stating that the license doesn't match any of the listed values. The message contains a regex string from all of the values found in the registered data for the license attribute concatenated together. Since the templates are all >800 characters long, combining all 4 created a string too long for the C string arrays that were only allocated to 1024 chars.
Note: Strings should actually be 1023 characters long since the 1024th char needs to be null ('\0'
) since C can only process null-terminated strings.
@mauzey1 can you go ahead and implement the simplified license check - the license check string proposed above https://github.com/PCMDI/cmip6-cmor-tables/issues/376#issuecomment-1199179966 (768 chars) please?
I believe this will solve the two standing PrePARE and E3SM-2-0 issue
@durack1
Okay. So we want to get rid of the current 4 templates and replace them with the more flexible one proposed by @matthew-mizielinski?
I agree with this.
@mauzey1 exactly, hopefully this will resolve both issues at once
And I think a non-conforming license should raise a warning, not an error. I don't think that in general (and I know for CMIP specifically) there is no mandate to force folks to adopt one of our recommended licenses.
Hey @matthew-mizielinski I could take a look, but would need some example data that would break of which we have already replicated. We don't check replicas on the presumption that the original data had already passed prepare and the files are identical. That said I'm aware that the Canadians had an issue recently with PrePARE crashing so I could reach out to them on this issue.
@mauzey1 I second the comment of @taylor13 that a non-conforming license should raise a warning, rather than an error and stopping execution.
Also, to attempt to capture the shorten CC license identifier, can we use the template:
"^CMIP6 model data produced by .* is licensed under a Creative Commons .* License
(.*https://creativecommons\\.org/.*)\\. *Consult https://pcmdi\\.llnl\\.gov/CMIP6/TermsOfUse
for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment\\.
*Further information about this data, including some limitations, can be found via the further_info_url
(recorded as a global attribute in this file).*\\. *The data producers and data providers make no warranty,
either express or implied, including, but not limited to, warranties of merchantability and fitness for a
particular purpose\\. *All liabilities arising from the supply of the information (including any liability
arising in negligence) are excluded to the fullest extent permitted by law\\.$"
The tweak from * License (https://creativecomm
to * License (.*https://creativecomm
above, will allow the CC BY 4.0;
to be added (e.g. , ..License (CC BY 4.0; https://creativecomm..
) which is my preference. This should solve the problem that still appeared in https://github.com/E3SM-Project/CMIP6-Metadata/issues/7
Among the 4 possibles license values in CMIP6_CV.json, 2 lines lead to a "segmentation fault" when used by old PrePARE version (relying on CMOR version < 3).
The corresponding lines are:
https://github.com/PCMDI/cmip6-cmor-tables/blob/35914468d2d4ced42895e94a8630350c22d58a3b/Tables/CMIP6_CV.json#L3530
https://github.com/PCMDI/cmip6-cmor-tables/blob/35914468d2d4ced42895e94a8630350c22d58a3b/Tables/CMIP6_CV.json#L3532
By the way, the JSON syntax seems to be valid. Error appears when PrePARE try to use the loaded CV with cmip6_cv methods like
cmip6_cv.check_requiredattributes(table)
, nevertheless, the CV loading few line code above works fine and returns 0:table = cmip6_cv.load_table(cmip6_table)
So, I guess there is/are some characteres in the two licenses that are mis-parsed by CMOR in some way.Removing the two lines from the license list makes PrePARE working perfectly fine.