PCMDI / cmip6-cmor-tables

JSON Tables for CMOR3 to create CMIP6 dataset
BSD 3-Clause "New" or "Revised" License
31 stars 46 forks source link

License list leads to segmentation fault in old PrePARE versions. #376

Closed glevava closed 2 years ago

glevava commented 2 years ago

Among the 4 possibles license values in CMIP6_CV.json, 2 lines lead to a "segmentation fault" when used by old PrePARE version (relying on CMOR version < 3).

The corresponding lines are:

https://github.com/PCMDI/cmip6-cmor-tables/blob/35914468d2d4ced42895e94a8630350c22d58a3b/Tables/CMIP6_CV.json#L3530

https://github.com/PCMDI/cmip6-cmor-tables/blob/35914468d2d4ced42895e94a8630350c22d58a3b/Tables/CMIP6_CV.json#L3532

By the way, the JSON syntax seems to be valid. Error appears when PrePARE try to use the loaded CV with cmip6_cv methods like cmip6_cv.check_requiredattributes(table), nevertheless, the CV loading few line code above works fine and returns 0: table = cmip6_cv.load_table(cmip6_table) So, I guess there is/are some characteres in the two licenses that are mis-parsed by CMOR in some way.

Removing the two lines from the license list makes PrePARE working perfectly fine.

matthew-mizielinski commented 2 years ago

Hi @glevava,

I've tested this with CMOR v3.6.1 and I'm seeing segfaults if there is more than one entry in the list (I thought I'd tested this before the change to the CVs file was merged). We have tried this with the one of the testing builds on the PCMDI conda channel, cmor=3.6.1.2022.03.14.18.32.ga7f960d with python 3.8 (see here), and it worked fine, but we haven't been able to understand the change that leads to this behaving itself.

CMOR v3.7.0 is fairly close as I understand it -- would updating to use this, once released, be an option? (I appreciate that this will probably require updates across all active ESGF nodes).

glevava commented 2 years ago

I'm not sure this is due to the number of entries in the list as I have now two items in the list and it works like a charm. Anyway, I tried to install the latest CMOR stable release but I gave up because we are at the end of CMIP6, security policy of our HPC don't provide internet access (so I have to manually transfer all CMOR binaries and dependencies) and finally it appeared to me modifying the JSON was a simpler workaround ; )

In the future (CMIP7?), we will change our workflow and the new release of CMOR could be an option yes. Nevertheless, I must admit we didn't use CMOR at IPSL nor at CNRM, our GCMs directly produce standardized CMIP data. So we only need CMOR to run PrePARE which is quite overkill. It would be appreciated to have PrePARE as standalone tool maybe in full Python. This would be also in line with a controlled vocabulary service we are thinking about in the ESGF in order to check and provide metadata at the different steps of the publication process.

durack1 commented 2 years ago

@glevava just an FYI, we (@matthew-mizielinski, @sashakames, @taylor13) have started thinking about how to cleanup what we currently have so to move forward in a more modular (and non-duplicated) way, a sketch of the potential path can be found in https://github.com/WCRP-CMIP/CMIP6Plus_CVs/issues/1#issuecomment-1182574956 - happy to hear your views on this (and ask away if the vagueness of the info requires more detail)

mauzey1 commented 2 years ago

@durack1 The link in your post above doesn't work.

durack1 commented 2 years ago

@mauzey1 thanks, apologies this is a currently private repo, will add you and @glevava

matthew-mizielinski commented 2 years ago

I've just had a slightly deeper dig into this as I think it is affecting ESGF publication at CEDA too, and I may have a possible mechanism for why this is failing at v3.6.1 (not sure why it is working with the nightly builds, unless there is a change in one of the limits);

I've had a look at the way in which PrePARE is failing and by trial and error found that if I shorten the license strings such that the total number of characters across all four entries in the list is less than 1008 then PrePARE works fine (I'm guessing that there is a 1024 byte limit on the list holding the licenses and there is a ~5 byte cost for the separation of entries in the list -- adding an extra entry to the list seems to segfault at 1003 characters total).

While I think that having the multiple patterns is by far the most elegant solution, but we could squeeze the four patterns we have into one more general one. The pattern below (line breaks inserted for readability) would cover the four cases, but it does leave the license text more open to typos.

"^CMIP6 model data produced by .* is licensed under a Creative Commons .* License (https://creativecommons\\.org/.*)\\. 
*Consult https://pcmdi\\.llnl\\.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation 
requirements and proper acknowledgment\\. *Further information about this data, including some limitations, can be found via 
the further_info_url (recorded as a global attribute in this file).*\\. *The data producers and data providers make no warranty, 
either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose\\. *All 
liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest 
extent permitted by law\\.$"

A quick google suggests that the regex C library being used here is very limited in its capabilities, so I haven't been able to work out how to construct a suitable pattern with the different options rather than using .* a couple of times. There is a library with a greater regex capability (PCRE2) that we could look at for future work.

@durack1 @mauzey1 @taylor13

Would it be worth making a change to work within (what I assume to be -- I could be wrong) the character limit for the license text in this repository now to ensure that ESGF publication chains can pick this up without having to wait for the next version of CMOR? We can then put the more elegant solution into post CMIP6 work.

matthew-mizielinski commented 2 years ago

@sashakames, if you have a moment could you have a look and see whether the same issue is affecting LLNL publication?

taylor13 commented 2 years ago

thanks @matthew-mizielinski for your efforts to diagnose and provide a likely explanation for the problem. If there is some urgency to fix this problem, I would do as you say. I wouldn't think treating the license string as a special case that could exceed 1024 characters would be that difficult, but @mauzey1 could advise. He can tell us whether we should wait for the more elegant solution or implement the quicker fix you suggested.

durack1 commented 2 years ago

@matthew-mizielinski thanks for this, in an offline discussion with @mauzey1 last night we came to the same conclusion that updating the license check string, from 4 to 1 should solve this problem, and ensure that folks using the latest version of the tables can use these with CMOR 3.6.1 and previous. What you have proposed above https://github.com/PCMDI/cmip6-cmor-tables/issues/376#issuecomment-1199179966 (768 chars) is where I was headed.

The nightly builds have tweaks in place that deal with this license change, @mauzey1 can elaborate if required.

I believe this example is a primary reason why we need to move toward a CMOR4, to solve the string limits, but not break anything with CMIP6 that has been successfully using CMOR3.x

mauzey1 commented 2 years ago

@durack1 @matthew-mizielinski

The issue with using the 4 license templates with older versions of CMOR/PrePARE is that whenever a license is invalid, such as when using older licenses that don't match any of the 4 templates, CMOR will try creating an error message stating that the license doesn't match any of the listed values. The message contains a regex string from all of the values found in the registered data for the license attribute concatenated together. Since the templates are all >800 characters long, combining all 4 created a string too long for the C string arrays that were only allocated to 1024 chars.

Note: Strings should actually be 1023 characters long since the 1024th char needs to be null ('\0') since C can only process null-terminated strings.

durack1 commented 2 years ago

@mauzey1 can you go ahead and implement the simplified license check - the license check string proposed above https://github.com/PCMDI/cmip6-cmor-tables/issues/376#issuecomment-1199179966 (768 chars) please?

I believe this will solve the two standing PrePARE and E3SM-2-0 issue

mauzey1 commented 2 years ago

@durack1

Okay. So we want to get rid of the current 4 templates and replace them with the more flexible one proposed by @matthew-mizielinski?

I agree with this.

durack1 commented 2 years ago

@mauzey1 exactly, hopefully this will resolve both issues at once

taylor13 commented 2 years ago

And I think a non-conforming license should raise a warning, not an error. I don't think that in general (and I know for CMIP specifically) there is no mandate to force folks to adopt one of our recommended licenses.

sashakames commented 2 years ago

Hey @matthew-mizielinski I could take a look, but would need some example data that would break of which we have already replicated. We don't check replicas on the presumption that the original data had already passed prepare and the files are identical. That said I'm aware that the Canadians had an issue recently with PrePARE crashing so I could reach out to them on this issue.

durack1 commented 2 years ago

@mauzey1 I second the comment of @taylor13 that a non-conforming license should raise a warning, rather than an error and stopping execution.

Also, to attempt to capture the shorten CC license identifier, can we use the template:

"^CMIP6 model data produced by .* is licensed under a Creative Commons .* License
(.*https://creativecommons\\.org/.*)\\. *Consult https://pcmdi\\.llnl\\.gov/CMIP6/TermsOfUse
for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment\\.
*Further information about this data, including some limitations, can be found via the further_info_url
(recorded as a global attribute in this file).*\\. *The data producers and data providers make no warranty,
either express or implied, including, but not limited to, warranties of merchantability and fitness for a
particular purpose\\. *All liabilities arising from the supply of the information (including any liability
arising in negligence) are excluded to the fullest extent permitted by law\\.$"

The tweak from * License (https://creativecomm to * License (.*https://creativecomm above, will allow the CC BY 4.0; to be added (e.g. , ..License (CC BY 4.0; https://creativecomm..) which is my preference. This should solve the problem that still appeared in https://github.com/E3SM-Project/CMIP6-Metadata/issues/7