HEASARC / cfitsio

C and Fortran library for reading and writing FITS files
24 stars 9 forks source link

BUG make string column nulls compliant with the FITS standard #7

Open beckermr opened 11 months ago

beckermr commented 11 months ago

The standard states

Character. If the value of the TFORMn keyword specifies Data Type ’A’, Field n shall contain a character string of zero-or more members, composed of the restricted set of ASCII-text characters. This character string may be terminated before the length specified by the repeat count by an ASCII NULL (hexadecimal code 00). Characters after the first ASCII NULL are not defined. A string with the number of characters specified by the repeat count is not NULL terminated. Null strings are defined by the presence of an ASCII NULL as the first character.

So strings are not NULL terminated, but if they end before the fixed width ends, they are filled with NULLs. Currently cfitsio fills with spaces. The patch here leaves spaces working for ascii tables but uses nulls for binary tables.

beckermr commented 11 months ago

cc @esheldon @olebole

xref: https://github.com/esheldon/fitsio/issues/389

c181gordon commented 10 months ago

This behavior originated with CFITSIO's initial implementation in Fortran (and which is now maintained with cfortran macro wrappers), where null-terminations are not used for strings. It is still compliant with the FITS standard, which only says that the string "may" be null-terminated if it ends early.

This would only seem to matter if one has a string where the trailing whitespace is significant. For example if "hello world " needed to be distinguished from "hello world" when reading back the string. Is this limitation causing problems with your specific use cases? There's also a backwards compatibility issue here in that the new behavior would change the file's checksum and perhaps break existing regression tests. For this reason we'd prefer not to make this change unless there's a pressing need.

beckermr commented 10 months ago

The “may” refers to the possibility of the string ending early, not whether or not a NULL is used to terminate it vs some other character.

beckermr commented 10 months ago

Thus filling with spaces is effectively lengthening a user’s input.

I don’t know why we’d be concerned about backwards compatibility when the code is not obeying the standard in the first place.

c181gordon commented 10 months ago

That's right, the "may" doesn't mean there are other ways to terminate the string early. CFITSIO is simply representing the string with padded trailing whitespace rather than terminating early. I just meant that such strings (with the trailing whitespace) are not in violation of the standard -- strings do not need to have trailing whitespace replaced with an early terminating null. (When CFITSIO reads the string back in, the trailing spaces are removed.) This only becomes a problem when/if the trailing whitespace is significant.

beckermr commented 10 months ago

CFITSIO is simply representing the string with padded trailing whitespace rather than terminating early.

Yes, and this is a violation of the standard. The standard says that a string may end early. Then it says, if it does, it should be terminated with a NULL. When a user passes a string with fewer characters than the width of the string field, how is that not ending a string early?

When CFITSIO reads the string back in, the trailing spaces are removed.

Where is this behavior specified in the FITS standard?

This only becomes a problem when/if the trailing whitespace is significant.

Where in the FITS standard is trailing whitespace in strings in binary tables always assumed to be not significant and so can be truncated as currently done in CFITSIO?

beckermr commented 10 months ago

strings do not need to have trailing whitespace replaced with an early terminating null.

Agreed. The patch above does not remove any trailing whitespace. For binary tables only, it terminates strings that end early with a NULL, as required by the standard.

esheldon commented 10 months ago

This patch guarantees that strings can round trip with fidelity, so you can read back what you wrote without loss of information.

beckermr commented 9 months ago

How can we continue this discussion and/or get this merged?

c181gordon commented 9 months ago

You may always contact us directly at the CFITSIO/CCfits help desk at ccfits@heasarc.gsfc.nasa.gov, or we can continue to discuss this here. Just to reiterate, the backwards compatibility issue is still a major sticking point. We've had some local discussions since the last round, and one proposed solution was to introduce a new function (ie. fits_write_col_tstr()) that would implement this behavior specifically. But this would also need an equivalent read function, since CFITSIO's behavior has always been to remove trailing whitespace upon reading the column strings back in.

olebole commented 9 months ago

Could you stay here? It much more transparent if the discussion remains public. This is also the idea of having a public repository and public pull requests.

c181gordon commented 9 months ago

(To both Matthew and Ole) I'm still curious about your own usage and what has led to this request. Were you dealing with current use cases that failed due to CFITSIO's handling of trailing whitespace? Or are you concerned with usage in the future, where you'd require this behavior?

beckermr commented 9 months ago

I suggest we introduce a configuration switch that is something like "--fits-standard-compliant" or similar that would enable all backwards incompatible fixes related to the legacy routines not being compliant with the fits standard.

olebole commented 9 months ago

I'd disagree here. In Debian, we want to have one shared library that serves all applications, and there should be one API and not several that differ in minor issues, even when one could select them at library compilation time.

esheldon commented 9 months ago

My use case is the need to round trip data, get out exactly what is put in in all cases

beckermr commented 9 months ago

I'm with @esheldon but I think the premise of this question misses the bigger issue.

The FITS standard exists so that multiple tools in different languages can read/write data in an interoperable fashion. When tools in that ecosystem do not obey the standards, it creates bugs and unexpected behavior for users.

Specific use cases or the fact that cfitsio can interpret the data it wrote under some assumptions are red herrings in this discussion.

c181gordon commented 9 months ago

I think we are still in disagreement as to whether this is a violation of the FITS standard (specifically section 7.3.3.1), and we can discuss that further if you'd like. But the reason I was asking about use cases is that it's unavoidable that we weigh the pros and cons of this.

I agree with your statement about how not obeying standards "creates bugs and unexpected behavior for users." But if we were to change the behavior of the current read/write_col_str functions, it would almost certainly create more problems than it fixes. Adding new functions to implement significant-trailing-whitespace behavior would avoid the backwards compatibility issue, but is also not without its downside (ie. adding clutter and complexity to the interface).

beckermr commented 9 months ago

The language is quite plain.

This character string may be terminated before the length specified by the repeat count by an ASCII NULL (hexadecimal code 00). Characters after the first ASCII NULL are not defined. A string with the number of characters specified by the repeat count is not NULL terminated. Null strings are defined by the presence of an ASCII NULL as the first character.

Specifically, if I pass an empty string (which is a null string), cfitsio represents it as a string of blanks. The standard clearly says that null strings are defined by the presence of an asci NULL as the first character. It doesn't get much more clear cut that cfitsio is writing strings incorrectly.

beckermr commented 9 months ago

I'll add that two wrongs don't make a right here. The old behavior should be deprecated and a flag enabling the new behavior should be added. Then after sometime, the new flag becomes the default.

c181gordon commented 9 months ago

The language could be more explicit if it used "shall" rather than "may" when referring to the termination of strings that don't fill the column width. Though you make a good point about the NULL string example. In any case, it wasn't a misinterpretation of this sentence which led to the current implementation. It was due to the fact that CFITSIO was originally implemented in Fortran, where unlike C/C++ strings are not terminated with nulls. ((C)FITSIO is of course still widely used by Fortran programs.)

Aside from the compiler flag objection raised by @olebole , deprecation does not deal with the problem of 30+ years of FITS files which only use padding rather than terminal nulls in their string columns. New behavior for the 'read' function would have no way of knowing if/where a terminal null should have been placed, so presumably it would just return ALL of the trailing whitespace.

olebole commented 9 months ago

I start to be a bit confused now: my understanding of the PR is that this covers writing only. So, if this change is implemented, cfitsio is just fixed to write strings in tables in a FITS conform way. And it can read FITS conform files (with null terminated string) already correctly, right? So what would be the impact of this one change? "Round-trip" should work as before. Just other software, that expect space-filled strings (which violate FITS!) and breaks with null-terminated strings would break. But that is acceptable (and a clear bug there). From this, I would just propose to merge the change without any deprecation as a bugfix. And it would be important to force people to use the fixed version ASAP because otherwise cfitsio continues to fill the world with standards violating FITS files.

Or do I understand it completely wrong?

beckermr commented 9 months ago

There is a question of when to remove trailing white space. A simple heuristic would be if the total string size is the field size. In that case, one could remove the white space. If it is shorter, then it would be kept.

olebole commented 9 months ago

Isn't this a question for reading (which is not covered by this PR)? For writing, one just takes the string as it is.

beckermr commented 9 months ago

Yes agreed, but it's clear this pr won't be merged without addressing compatibility.

c181gordon commented 9 months ago

Just to clarify things, the current CFITSIO behavior is: Write: fill all short strings with trailing blanks, no NULL terminator. Read: Ignore all trailing blanks and return a null terminated string (naturally since this is C).

The original PR was only for modifying the Write behavior. But for consistency, I've been assuming (and I may be wrong) that if we were to make trailing blanks significant now in the output, then they should also be significant upon input. For example if we output "hello[sp][sp]\0" and those 2 trailing spaces matter, it doesn't make sense to read it back in as just "hello\0".

As for the FITS violation, perhaps I haven't been clear in articulating my view. Even after parsing 7.3.3.1 many times, I don't see that a short string that is padded with whitespace to the end violates the FITS standard. (I also consulted with Bill Pence about this and he agreed.) So what CFITSIO has been producing all these years IS valid FITS in these cases.

The behavior that I think @beckermr and @esheldon were really objecting to, and they can correct me if I'm wrong, is that CFITSIO is not respecting the user's placement of the trailing null in the output string, and this is true. But the resulting output format is still valid FITS.

olebole commented 9 months ago

Sure - however isn't it already compatible within cfitsio?

So, a round-trip will continue to work without further changes.

The only problem I see is that other (non-cfitsio) software may not remove trailing spaces, but expect them later in the processing. If cfitsio now writes null-terminated strings, this software will fail. This however means that the software depends on standard violating FITS files created by cfitsio. This should be fixed in that software. A workaround coult be to add the spaces before feeding to cfitsio for write.

IMO this is a good compromise, isn't it?

(@c181gordon our posts crossed, so they partially overlap)

beckermr commented 9 months ago

The output format is a readable fits file if that's what you mean, but it represents the users input data incorrectly. That's the issue. Given an abstract piece of data, the cfitsio representation on disk does not match what the standard says it should be. This is very clear for null strings as I pointed out above.

@olebole fitsio will continue to carry this patch until upstream decides to take action to make this library conformant.

c181gordon commented 9 months ago

@olebole , yes our posts overlapped so I'll address your main question that I didn't touch on in my previous post, which is "why not just limit the change to Write-only as in the original PR?" Yes, this won't affect the way CFITSIO currently reads the string back in (because it chops off all whitespace at that stage), and I originally didn't see a problem with doing this.

However it was pointed out to me in our meetings that by merely replacing a single blank space char with a NULL, we would be changing the checksum of the file. There are several decades of mission software using Ftools/CFITSIO in their regression testing pipelines, and just this minor change in checksum could require a very large amount of work to rectify. I think we would have to offer a very compelling reason for putting them through this.

olebole commented 9 months ago

@c181gordon isn't this a very special case that could be addressed with a compile flag for just this one use case? Having a compilation option for rare edge cases would not be a problem for me.

c181gordon commented 9 months ago

Well yes, it applies only to a VERY special case: a binary table with column strings whose lengths doesn't extend the full width AND which have on or more trailing spaces that are significant. If I've appeared obstinate in this discussion, it's only because something this rare didn't seem to justify the downsides involved in the changes. If I'm wrong about it being very rare (now or as expected in the future), please let me know.

If necessary we could go with a compilation flag. But do you still have the objection that you mentioned earlier involving the Debian packaging?

olebole commented 9 months ago

IMO such a special case would not needed to be handled by Debian -- for the software we distribute we can adjust tests locally if really needed; I doubt that.

But I am not the maintainer of cfitsio (this is @aurel32). He should get the chance to veto here.

esheldon commented 9 months ago

This was brought up in issues for the python package fitsio, which "wraps" cfitsio for use in python. We had users complaining they could not round trip strings. I personally had had this issue as well.

I haven't done a survey of people in astronomy, but in my circles this was a common cause of complaint and required ad hoc fixes by end users to work around it.

olebole commented 9 months ago

@aurel32 as a short summary:

This patch changes the writing of strings in FITS tables from paddings spaces to FITS conform zero-terminated strings. The identified drawback here is that for the same input, the checksums of the files change, which may make some regression tests fail. This could be handled by adding a compilation flag for these cases. As Debian will only have one (default) library, some tests would fail and need an adjustment.

I would approve this change as suitable for Debian, however you are the maintainer.

c181gordon commented 9 months ago

@esheldon , Thanks for providing a use case. Could you please provide more details about what your round trip involved. For example did it involve a string with significant trailing whitespace that your were passing from Python to a FITS file and back again, such as:

Python: "hello[sp][sp]" -> FITS file: "hello[sp]....[sp]" ->back to Python: "hello"

aurel32 commented 9 months ago

@olebole: Thanks for the notice and the summary.

From the Debian point of view, whatever solution is chosen is fine, we will fix the affected packages. I did a quick with Astropy, and it seems to already have this behaviour, so I do not expect major breakages.

That said, as pointed by @olebole, I will stress out that the changes should not made configurable, otherwise we will quickly see different versions of cfitsio, and we will be impossible to provide a single shared version of cfitsio. In addition that will lessen the chances that fixes for the affected packages get accepted.

saimn commented 9 months ago

Hi, astropy's io.fits maintainer here :)

So on astropy side, io.fits is already stripping automatically the trailing spaces (relying on Numpy's chararray subclass, which ignores trailing spaces, but its usage is discouraged ... https://numpy.org/doc/stable/reference/arrays.classes.html#character-arrays-numpy-char). But this is an issue for us when using the astropy.table.Table interface, where currently trailing spaces are not removed (e.g. https://github.com/astropy/astropy/issues/15540, https://github.com/astropy/astropy/issues/2608, and more).

It would be better if cfitsio stopped writing files with whitespace padding, but given that there are already many files like that in the wild we would still need to strip spaces anyway.

It was my understanding that trailing spaces where considered as not significant in the Standard, but the formulation is actually not so clear:

Character. If the value of the TFORMn keyword specifies Data Type ’A’, Field n shall contain a character string of zero-or- more members, composed of the restricted set of ASCII-text characters. This character string may be terminated before the length specified by the repeat count by an ASCII NULL (hex- adecimal code 00). Characters after the first ASCII NULL are not defined. A string with the number of characters specified by the repeat count is not NULL terminated. Null strings are defined by the presence of an ASCII NULL as the first character.

beckermr commented 9 months ago

Thanks for chiming in @saimn!

It was my understanding that trailing spaces where considered as not significant in the Standard, but the formulation is actually not so clear:

Yeah I agree but I'd take the argument that "trailing spaces not being significant is not in the standard" a bit further.

I would have thought that if the standard's authors meant trailing spaces to be non-significant, they would simply say so in plain language.

gpdf commented 2 weeks ago

if the standard's authors meant trailing spaces to be non-significant, they would simply say so in plain language.

As they did for header keywords.

gpdf commented 2 weeks ago

A note: the language in the FITS standard concerning the use of NULL in binary-table string-valued columns has been quite stable since it first appeared in draft form in section A.6 of the FITS 1.0 document in 1993:

https://fits.gsfc.nasa.gov/standard10/fits_standard10.pdf

A character string may be terminated before its explicit length by an ASCII NULL character. An ASCII NULL as the first character will indicate an undefined string i.e. a NULL string. Legal characters are printable ASCII characters in the range ' ' (hex 20) to '~' (hex 7E) inclusive and ASCII NULL after the last valid character. Strings the full length of the field are not NULL terminated.

Some superficial searching on FITSBITS didn't come up with any substantive discussion of trailing blanks in binary table columns, over many years of history.

I do read the "may" in the standard slightly differently from the OP, I think: I believe the FITS standard is basically about the file format, not about client software behavior more generally, so I read the "may" as "in a conformant file, a binary-table fixed-length string column 'may' contain NULLs, and if it does, they mean early termination of the string value". Read this way, I think it's clear that there's little or no point to introducing this concept while also intending trailing blanks to be insignificant / truncated on read -- if the latter were the rule, why even bother allowing the NULL? It doesn't provide any semantic capability that wasn't already there.

I'd like to understand more rigorously what the controversial point that's blocking this PR is. Backward compatibility was mentioned, and that has two branches: possible changes in the behavior on read for existing FITS files; and possible changes on write when existing, stable software that uses cfitsio is re-built against a newer version.

Some questions:

The change requested in the PR is only on write, apparently, and it's "write a NULL, instead of padding with trailing blanks, if the in-memory string is shorter than the declared column width".


As an aside, VOTable considers blanks significant in strings, but does pad fixed-length strings with trailing blanks if the <TABLEDATA> representation's data field is shorter than the declared length of the string. Of course, VOTable does also explicitly support variable-length strings. In <TABLEDATA> format, variable length strings of length 0 (which most of us would colloquially call a "null string") are not distinguishable from actual NULL values (in the sense of a database NULL); in <BINARY2> format, they are distinct.

beckermr commented 2 weeks ago

Thanks for the comments!

What changes should happen, if any, on read for binary-table string values that do not contain NULL characters?

None IMHO. This is important for ensuring things are backwards compatible for already written files.

What changes should happen, if any, on read for binary-table string values that do contain NULL characters?

The data should be returned to the user as it is. For someone in C, they will get a null-terminated string from a file with nulls, terminated at the first null.

What changes should happen, if any, on write when the in-memory column value is shorter than the column width (whether via null termination or a counted-string representation)?

IMHO, the code should terminate the string with nulls, as this PR does.