Open wachsylon opened 5 years ago
Originally, we had different string lengths for different attributes, but that go simplified, as I recall, so that all have the same length 1024 (internally in CMOR). I think there was some issue with these strings occupying lots of storage, so we didn't make it huge. I don't think there is a fundamental reason for limiting the attributes to 1024 characters, but it would be better to not have a bunch of unused space reserved for such things as source_id and experiment_id. Which attribute needs more space?
We discussed to specify a paper for references
.
Would it be difficult to double the length of the references
attribute, but not the others? Would this cause any problems? Would doubling be sufficient? Note that if any single attribute is very long it will make the ncdump -h
result less easy to digest.
There have been changes made in CMIP6_CV.json in cmip6-cmor-tables that add strings with lengths that exceed CMOR's current max length of 1024. This has caused tests in that repo to fail.
I think we should consider increasing CMOR_MAX_LENGTH
to 2048 or 4096 to allow longer strings for attributes. I'm not sure how much that would impact file size.
@durack1 @doutriaux1 Do you know why 1024 was selected for the maximum string length? Do you think it could be increased now?
Trouble is CMOR_MAX_LENGTH is for almost all strings created by CMOR, and the internal table is already huge with wasted empty space. I don't think we want to double it. A better solution would be to define different groups of string variables with different max. length.
@mauzey1 I tend to agree with @taylor13 that we should think a little about memory usage when the source
attribute is the only string that needs changing, other fields (such as activity_participation
etc) are controlled, well under the 1024 limit and will not change. Only the institution_id
and source
in the source_id
are likely to need expansion.
@taylor13 is right the tables size is now huge we should create groups:
CMOR_TINY_LENGTH=16 CMOR_SHORT_LNGTH=32 CMOR_MESSAGE_LENGTH=128 CMOR_PARAGRAPH_LENGTH=1024 CMOR_TEXT_LENGTH=2048 CMOR_BLOG_LENGTH=4096
Or someting like that.
most can probably fit in the first 3 and it would save memory and speed up things
@mauzey1 as part of the fix for this, we'll need to implement a test so that the nightly changes to CMIP6_CV.json
are tested and not merged in the case that such an error occurs
I support the @doutriaux1 suggestion above. As I recall the original FORTRAN version was implemented that way. We would need to document the max length of each string under the control of users (or in the CV's read by CMOR).
@taylor13 with a pure python implementation string lengths would not be fixed at compilation time, so the technical challenge would be removed (python can dynamically size any string as required), but we'd need to implement some guidance so ridiculous entries are flagged/warning at a minimum, and potentially error is particularly egregious cases
Good point. I understood that we were dropping the FORTRAN option, but didn't realize that nothing would be written in C. Will pure python be fast enough? Can it be parallelized, or aren't we going to pursue that?
@taylor13 @mauzey1 had planned to test the performance by leveraging the latest libraries associated with xarray, so DASK, etc. From my understanding, the current performance bottlenecks relate to the netcdf library, but this will be important to benchmark as we progress. I know that @matthew-mizielinski has been monitoring the CMOR performance on their new systems, so it will be important to keep track of this as we go
Hi, is there a reason why CMOR_MAX_STRING is 1024 or could we increase it? https://github.com/PCMDI/cmor/blob/829a78ae2e6dddb0e625807bff49c36de3eb72e3/include/cmor.h#L11 Best regards, Fabi