Partial data update to existing YAML file (preserving comments, whitespace)

By principal, and in the current scope of pydantic-yaml, a file saved back from a Pydantic object only contains features stored in the Pydantic object. This affects flexible syntax (quoting style, flow styles with […], {…}, whitespace) and features that have no representation in Pydantic (comments).

For our use case, Pydantic support is essential (so other YAML parsers are not sufficient). However, comments are not only nice to have but can be extremely valuable documentation worth to preserve. Also, proper use of blank lines keeps files readable. When we use pydantic-yaml to read, update and save a YAML file, it is completely regenerated and reformatted, changing all formatting and whitespace to the encoder's defaults and dropping all comments.

I think this is not an uncommon use case.

Are there any recommendable strategies how one can have both, Pydantic support and preserving most/all unchanged tokens?

Like using a comments-preserving parser and merging the updated Pydantic object into the parser's in-memory representation etc.
Storing comments/whitespace surrounding a value in a dataclass's metadata/Pydantic's FieldInfo extra?

Hi, yes, saving comments and existing formatting is a thing that has been requested before, and I've tried to figure out how to support it. That's one of the reasons the backend engine is ruamel.yaml which supports round-tripping most data as well!

However, it's very unclear how to support this specifically with Pydantic's tools.

Ruamel uses custom CommentedMap and similar classes to support saving this comment metadata.

In theory, when parsing, we can save this raw form to a field in the Pydantic model object (custom __commented_data field?), however the trouble occurs when we have to update any of this data. Since @validators and @root_validators can change everything on the fly (e.g. turning lists into dicts, scalars into whatever, whatever into scalars...), there's only an indirect mapping between incoming values and Pydantic fields. There's no way to get the inverse mapping between Pydantic fields and the incoming data, either, unless we hope that the incoming data is of the right type. 😅 I guess when you dump the object, pydantic-yaml can try to reverse-merge it into the Ruamel objects whenever the keys and columns match.

An alternative implementation, as you mentioned, is to keep our comments ourselves. FieldInfo is a class variable, not an instance variable, so we can't use that (unless you want the same comment to be generated for every value... that has it's uses, i.e. documenting what the field should be!). We'd still need to keep some sort of __commented_data field for each object. However, you still probably have to you keep a "parallel" dictionary-list-etc. hierarchy, since nested field will be broken whenever you re-create a child model. Modifying the schema of child classes is not fun - I've tried that hack and it's not worth it...

So, basically, you must keep some parallel representation of the object with comments, but that also includes things possibly breaking due to validators changing the types under our feet. You would also need to either comments on assignment or upon dumping. (Not mentioned, but YAML links are probably out of the question, too - you'd need a YAML parser with a Pydantic backend, no the other way around 😉). Perhaps preserving comments on a "well, we tried" basis - "if your schema changes, you will lose your comments" - would be okay for your "configuration is changing" use case.

Unfortunately, I don't think I have the engineering time to support either way in pydantic-yaml at this time, but I'd be very open to reviewing code; even just working prototypes of this functionality would be very useful!

I just realized that I didn't even get to whitespace preservation... I'm not sure even ruamel.yaml does that consistently, though probably has the best option there.

Some other thoughts (mostly open-ended or opinionated):

pydantic-yaml can possibly use a @root_validator(pre=True) to extract the ruamel.yaml information into the model, and if a comment metadata field is available - add it to the field. Lots of care is needed for extra = "forbid" and making sure not to dump that field... but it seems possible.
Do you preserving existing comments when the field value changes?
What to do with sub-models that also have comments?
How do you support "default comments"? (My guess is an "extra" field or the "description" from FieldInfo).

Reiterating that I'd love to support this, but the custom-yaml-engine tag on this issue is frankly what it seems like would be the cleanest implementation...

NowanIlfideme / pydantic-yaml

Partial data update to existing YAML file (preserving comments, whitespace) #74