MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.29k stars 646 forks source link

XYZWriter can't append a file #3836

Closed jaclark5 closed 11 months ago

jaclark5 commented 1 year ago

Expected behavior

Actual behavior

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import LAMMPSdata

universe = mda.Universe(LAMMPSdata)

group = universe.select_atoms("all")
with mda.Writer("output.xyz", group.n_atoms) as W:
    for ii,ts in enumerate(universe.trajectory):
        W.write(group)

with mda.Writer("output.xyz", group.n_atoms, mode="at") as W:
    for ii,ts in enumerate(universe.trajectory):
        W.write(group)

universe2 = mda.Universe("output.xyz", format="XYZ")
print(len(universe.trajectory), len(universe2.trajectory))

Current version of MDAnalysis

jaclark5 commented 1 year ago

This issues were resolved in this fork

https://github.com/jaclark5/mdanalysis/tree/xyz_writer

hmacdope commented 1 year ago

@jaclark5 a bit of discussion required on this, as I (personally) am unsure about the status of appending to existing files in writers.

@MDAnalysis/coredevs do any other writers allow appending? Is this something we want to support going forward? I imagine this could lead to errors where you append to a file when unintended, but I suppose if you specify the file mode you are already being quite brave and should know what you are doing?

orbeckst commented 1 year ago

These issue are two different problems. To keep the discussion focused, could you please open a second issue for point 2 ("Write out XYZ with atom types when names or elements are not available (applicable when LAMMPS files are imported)"), @jaclark5 ?

Thank you, focused issues really help us to partition time that we can spend on things.

orbeckst commented 1 year ago

Regarding appending: I would like to see a survey of what our writers can do — can any of them append? (@lilyminium did you ever look into this question as part of the User Guide: Formats review?)

IAlibay commented 1 year ago

open a second issue for point 2 ("Write out XYZ with atom types when names or elements are not available (applicable when LAMMPS files are imported)")

Yes please, I have concerns about whether or not this should be allowable under the classical understanding of the XYZ file format and I think it needs further discussion.

IAlibay commented 1 year ago

Regarding appending - I'd like to better understand the use case. Under what case is this necessary? Why is it not feasible to have all cases of writing to a single file happening under the same context manager?

I think understand how it's being used will help clarify whether or not this is a feature that we need or if we need to provide separate tooling.

jaclark5 commented 1 year ago

I don't see the harm in adding a keyword argument as most users won't even know it's there. There's a package called Zeno that our group uses that takes XYZ files and rather than making a few hundred individual files and then listing their names it's convenient to add many frames (even with a different number of atoms) to a single file. I used MDAnalysis to choose the atoms. If it's a keyword, it doesn't change normal user workflow.

IAlibay commented 1 year ago

I don't see the harm in adding a keyword argument as most users won't even know it's there.

Unfortunately the problem comes down to "long term maintenance", small changes to allow for tweaks in without careful examination / planning can very quickly snowball into messes such as our PDB writer or the DCD format and its many flavours.

It's an annoying bureaucracy of long term maintenance 😅

There's a package called Zeno that our group uses that takes XYZ files and rather than making a few hundred individual files and then listing their names it's convenient to add many frames (even with a different number of atoms) to a single file. I used MDAnalysis to choose the atoms.

So if I understand your use case better, you want to create XYZ trajectories but selecting different atomgroups either in a static frame or across a trajectory?

Disregarding the variable number of atoms (there's a slight issue here where we can't roundtrip such a file, but we can discuss that after).

Would it make more sense to have to access pattern not involve several file open/close calls but instead consider a solution that does:

> open file

> write atomgroup 1
> write atomgroup 2
> write atomgroup N

> close file

?

lilyminium commented 1 year ago

@IAlibay not totally sure we can do that without an append mode -- the writers require the number of atoms to initialize and I think XYZWriter might run a check for each group that's written.

I personally don't have a strong opinion either way on whether to support custom modes are not, but I'm leaning against for now because of the additional maintenance burden. If more people need this use case, I'd be more in favour of adding the keyword, but for now it doesn't seem like too many steps to concatenate manually at the end with cat or even Python postprocessing?

orbeckst commented 1 year ago

I feel if we support writing trajectories with variable number of atoms per frame then we should also support reading them. And we don't — topologies with time varying number atoms is a major, major project in MDAnalysis. I am not even sure if anyone is willing to tackle it, even if they were to get paid to do it ;-).

So I am tending towards against supporting it inside core.

I'd recommend a custom writer (MDAKit or similar).

orbeckst commented 1 year ago

This still leaves the question for append mode open (for fixed number of atoms).

(I understand that enabling append mode will then allow the writing of custom XYZ trajectories and there's nothing we can do about it but that's just users doing what works for them and that would be fine with me. EDIT: I am not 100% sold on adding append mode until we better understand how it fits in with everything else — just echoing https://github.com/MDAnalysis/mdanalysis/issues/3836#issuecomment-1255397802 about long-term maintenance.)

IAlibay commented 1 year ago

@IAlibay not totally sure we can do that without an append mode -- the writers require the number of atoms to initialize and I think XYZWriter might run a check for each group that's written.

Sorry my bad yeah I should have been clearer here - any changes here to meet this need would require some large change in the way WriterBase works.

All I'm trying to do here is understand the usage pattern needed. Even if we say no for now, I'd like to have a record here of what @jaclark5 would prefer so that things are clear if/when we revisit this.

richardjgowers commented 1 year ago

I think adding an optional append mode to WriterBase has its uses, then XYZ (and other ascii formats) will likely be able to quickly use this, binary formats might require some thought.

IAlibay commented 1 year ago

So my initial view here, echoing @lilyminium and @orbeckst is that we have to be really careful about maintenance and specifically writers becoming increasingly dissimilar to each other.

  1. I don't really like the idea of adding special cases for one of the writers. If anything, it often just ends up creating a "why did we do this? do we still need it?" style problem a few years after the fact.

  2. As @orbeckst mentioned, if we write it, we should read it. Variable atoms in trajectories is unfortunately not in the plans any time soon.

If we want to have an "append" style approach, then I would like to (as mentioned above) ask for more details on the intended usage pattern.

My view is that in that case (no atom number changes but unlimited append) enabling (documenting mostly) the non context manager approach (i.e. where you manually open the file, call the relevant reader methods for writing, and then close the file) would be much better here.

jaclark5 commented 1 year ago

It seems I caused more problems by bringing up my application with variable atoms. Regardless, I didn't think exposing an inherent option would be an issue.

IAlibay commented 1 year ago

It seems I caused more problems by bringing up my application with variable atoms.

I wouldn't necessarily say it's "problems", it's good to know that these applications exist, and in an ideal world we should support them.

Regardless, I didn't think exposing an inherent option would be an issue.

Could I ask if you could please give more information on your ideal use case here? I.e. as mentioned above, would it be more reasonable if you didn't open & close the file multiple times or is there a specific need to open the file in append mode after the fact?

I'm asking this so that we can see if there's a possibility for a bigger fix. Ideally it'd be good to have a definitive solution to this issue that helps folks beyond just the XYZWriter.

jaclark5 commented 1 year ago

Fair enough, I guess I wouldn't need to open and close it multiple times if I could write a different number of atoms along the trajectory, but it seems that you guys aren't interested in that scenario. I think a plausible scenario that you might be interested in would be an incremental conversion of files. Gromacs and LAMMPS each have strengths with their features and last I checked there isn't a good tool to convert between the two, but MDAnalysis would allow that (probably I haven't tried). So if I ran a long simulation in gromacs (for speed) and then wanted it in the form of a lammps trajectory for further analysis I can do that. Then, if I later realized I needed to run the simulation longer, I could append it to the lammps dump file without reloading and writing the previous set. As another scenario, I could import my trajectory, and then run analysis to calculate per atom or per molecule properties and then (probably I'll have to check and possibly open an issue) write those out in a lammps dump file. Then those properties can be used in VMD or Ovito to color code the visualization state. This procedure it commonly achieved within LAMMPS or in Ovito's python interface during post-processing (now behind a paywall): MDAnalysis is akin to the latter. So again, if I were to decide to extend my trajectory, I would need to reload and reanalyze the previous trajectory. Given, you may recognize that in each of these scenarios I could just have two files and import them sequentially in the next step.

hmacdope commented 11 months ago

As this became a broader API question, in #3861 and #3875 I am closing this particular issue.