CMOR_MAX_STRING - Githubissues

wachsylon commented 5 years ago

Hi, is there a reason why CMOR_MAX_STRING is 1024 or could we increase it? https://github.com/PCMDI/cmor/blob/829a78ae2e6dddb0e625807bff49c36de3eb72e3/include/cmor.h#L11 Best regards, Fabi

taylor13 commented 5 years ago

Originally, we had different string lengths for different attributes, but that go simplified, as I recall, so that all have the same length 1024 (internally in CMOR). I think there was some issue with these strings occupying lots of storage, so we didn't make it huge. I don't think there is a fundamental reason for limiting the attributes to 1024 characters, but it would be better to not have a bunch of unused space reserved for such things as source_id and experiment_id. Which attribute needs more space?

wachsylon commented 5 years ago

We discussed to specify a paper for references .

taylor13 commented 5 years ago

Would it be difficult to double the length of the references attribute, but not the others? Would this cause any problems? Would doubling be sufficient? Note that if any single attribute is very long it will make the ncdump -h result less easy to digest.

mauzey1 commented 5 years ago

There have been changes made in CMIP6_CV.json in cmip6-cmor-tables that add strings with lengths that exceed CMOR's current max length of 1024. This has caused tests in that repo to fail.

I think we should consider increasing CMOR_MAX_LENGTH to 2048 or 4096 to allow longer strings for attributes. I'm not sure how much that would impact file size.

@durack1 @doutriaux1 Do you know why 1024 was selected for the maximum string length? Do you think it could be increased now?

taylor13 commented 5 years ago

Trouble is CMOR_MAX_LENGTH is for almost all strings created by CMOR, and the internal table is already huge with wasted empty space. I don't think we want to double it. A better solution would be to define different groups of string variables with different max. length.

durack1 commented 5 years ago

@mauzey1 I tend to agree with @taylor13 that we should think a little about memory usage when the source attribute is the only string that needs changing, other fields (such as activity_participation etc) are controlled, well under the 1024 limit and will not change. Only the institution_id and source in the source_id are likely to need expansion.

doutriaux1 commented 5 years ago

@taylor13 is right the tables size is now huge we should create groups:

CMOR_TINY_LENGTH=16 CMOR_SHORT_LNGTH=32 CMOR_MESSAGE_LENGTH=128 CMOR_PARAGRAPH_LENGTH=1024 CMOR_TEXT_LENGTH=2048 CMOR_BLOG_LENGTH=4096

Or someting like that.

doutriaux1 commented 5 years ago

most can probably fit in the first 3 and it would save memory and speed up things

durack1 commented 5 years ago

@mauzey1 as part of the fix for this, we'll need to implement a test so that the nightly changes to CMIP6_CV.json are tested and not merged in the case that such an error occurs

taylor13 commented 4 months ago

I support the @doutriaux1 suggestion above. As I recall the original FORTRAN version was implemented that way. We would need to document the max length of each string under the control of users (or in the CV's read by CMOR).

durack1 commented 4 months ago

@taylor13 with a pure python implementation string lengths would not be fixed at compilation time, so the technical challenge would be removed (python can dynamically size any string as required), but we'd need to implement some guidance so ridiculous entries are flagged/warning at a minimum, and potentially error is particularly egregious cases

taylor13 commented 4 months ago

Good point. I understood that we were dropping the FORTRAN option, but didn't realize that nothing would be written in C. Will pure python be fast enough? Can it be parallelized, or aren't we going to pursue that?

durack1 commented 4 months ago

@taylor13 @mauzey1 had planned to test the performance by leveraging the latest libraries associated with xarray, so DASK, etc. From my understanding, the current performance bottlenecks relate to the netcdf library, but this will be important to benchmark as we progress. I know that @matthew-mizielinski has been monitoring the CMOR performance on their new systems, so it will be important to keep track of this as we go

PCMDI / cmor

CMOR_MAX_STRING #530