MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.29k stars 646 forks source link

Can't open NetCDF trajectories created by cpptraj #4073

Closed DrDomenicoMarson closed 1 year ago

DrDomenicoMarson commented 1 year ago

The mda.coordinates.TRJ.NCDFReader expect an Amber-generated NetCDF trajectory to have time information in the variables object of the NetCDF file. This is true for NetCDF files created by AMBER's MD engine, but that's not the case for files generated by AMBER's CPPTRAJ.

If I try to read a NetCDF file created by CPPTRAJ, I end up with the error message raised in line 532 of mda.coordinates.TRJ.NCDFReader, as the key time doesn't exist in the trajectory variables.

The assumption of the existence of the time variable is made in other parts of the code as well. It would be great to not have this assumption in place.

I edited the code in my fork, and I made it to read successfully NetCDF files generated by CPPTRAJ, but I don't know if how I made it could break something else in the workflow.

Simply, anywhere the code expected to read time information from the time variable, I replaced this information with the current frame index (for dt there is already a default behavior that uses dt=1ps if not read from the trajectory).

There is a problem with this implementation, although I don't know if this is a big problem after all: if I create a mda.Universe starting from a list of NetCDF files, the resulting universe does have not an incrementing time, as each concatenated trajectory file restarts its time from 0

for example, I have 5 trajectories of 100 frames, if I create an Universe from them and then call

 for ts in u.trajectory:
    print(ts.time)

I see values of 0 to 99 repeated 5 times, instead of 0 to 499.

If you want I can create a pull-request with the small modifications I made, if this behaviour is not expected to break some other functionalities of mda, as I just started using it and I'm not an expert at all!

IAlibay commented 1 year ago

This is a complicated one, and I think may come down to an interpretation of how the NetCDF format is defined.

For context the NetCDF trajectory convention is here: https://ambermd.org/netcdf/nctraj.xhtml

As per the convention (when talking about the time variable):

When coordinates on the frame dimension have a temporal sequence (e.g. they form a molecular dynamics trajectory), creators shall define this variable and write a float for each frame coordinate representing the number of picoseconds of simulated time elapsed since the start of the trajectory.

The word shall is quite important here, since it indicates that this is imperative to how the format should work.

However, we also find above:

All data variables are optional. Some data variables have dependencies on other data variables, as described below. 

Whilst the first half of this statement would make it seem that time is optional, my interpretation of the latter (and the statement above), is that time is a requirement when dealing with any type of temporal data.

As such I would have expected cpptraj to write the time entry to any trajectory.

I think this is probably a question for @drroe, as both an author of that format spec and the lead dev of cpptraj - what should be the expected behaviour here?

DrDomenicoMarson commented 1 year ago

I really hope that CPPTRAJ will eventually adhere to AMBER's own conventions :-)

In the meantime, maybe a workaround to be able to use CPPTRAJ-created trajectories could be useful, as if and when CPPTRAJ will be "fixed", the already generated trajectories will still be unusable, what do you think?

IAlibay commented 1 year ago

I'll be honest, I'm not particularly keen to start adding exceptions for things that break format spec (not saying that this is a break in format spec, I would say this ultimately falls to the [edit: spec] author's discretion).

The issue you face here is that this is the type of thing where you start introducing lots of spagetthi in your code to try to adhere to arbitrary problems elsewhere, and then very quickly you have a piece of code that's completely unmaintainble. A prime example of this is our PDB parser, which has had to do a lot of workarounds for things like the CHARMM format variants, and ultimately has become an extremely slow & bloated reader.

Should this be a bug in cpptraj (again, please don't take this as a claim that it is), I would expect cpptraj itself to be able to offer a means of recovering "fixed" trajectories.

DrDomenicoMarson commented 1 year ago

I understand, I'll just use my patched version until CPPTRAJ's authors will "fix" it then!

drroe commented 1 year ago

Whilst the first half of this statement would make it seem that time is optional, my interpretation of the latter (and the statement above), is that time is a requirement when dealing with any type of temporal data. As such I would have expected cpptraj to write the time entry to any trajectory.

"Shall" here means that if there is simulation time data present, the variable should be defined. This is the case with molecular dynamics trajectory data, but may not always be the case; for example the time variable will not be defined if the input trajectories are a series of Mol2 and/or PDB files, which have no concept of "simulation time". Very old versions of cpptraj (old meaning from year 2016 and before) didn't always write the time variable properly, but from that time on if the incoming trajectory has time data, cpptraj should be defining the time variable in an output trajectory. There is also an action called time which can be used to set the time to something user-specified, which may be a workaround for @DrDomenicoMarson .

It shouldn't be too difficult to change the code to make the time variable optional since the NetCDF format is so orthogonal. Just do a nc_inq_varid for "time", and if it isn't found set the time variable ID to -1 and avoid trying to read it.

I really hope that CPPTRAJ will eventually adhere to AMBER's own conventions :-)

As far as I can tell in this instance and elsewhere, CPPTRAJ is adhering to the Amber NetCDF standard as written. If you still feel this is not the case please open an issue on the CPPTRAJ GitHub site so it stays on my radar. Thanks, and hope this helps.

IAlibay commented 1 year ago

Thanks so much for the explanation @drroe, this is super useful!

@DrDomenicoMarson this does mean that we should try to fix for this case. However, whilst the proposed solution you laid out in your initial comment isn't bad, it might not be completely the right answer either. Because these are non-temporal frames, setting the time to the frame id is probably not the ideal solution, instead we should consider setting everything to single value (probably -1) and also make dt handle that case. I think there's a mechanism in Timestep for this, but it would need a bit of digging.

On top of that we probably should just change how we derive the number of frames when frames isn't defined, as defaulting to times isn't safe. Probably the answer here is to just inspect the size of coordinates (which is also meant to be optional, but in the case of MDAnalysis the whole reader would fall apart if there were no coordinates to begin with).