MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.3k stars 647 forks source link

Time aware slicing #800

Open jbarnoud opened 8 years ago

jbarnoud commented 8 years ago

When I run an analysis, I often discard the beginning of the trajectory, or I run it on a portion of the trajectory only. I think of these portions of trajectories in term of simulation time rather than number of frames. I discard the first 100 ns or I run the analysis on the last 5 µs; I do not discard the X first frame or run on the last Y frames. Also, I run my analyses with one frame every ns, not one frame every N ones.

It would be good if AnalysisBase could take times as start, stop, and step. It would be even better if it could be done at the reader level.

We should be careful that dt may not be constant throughout a trajectory.

richardjgowers commented 8 years ago

You can do bool slicing, so maybe mask = [ts.time > 100 for ts in u.trajectory], for ts in u.trajectory[mask]. But that isn't very smart as you're only interested in the first time you see the inequality operator change (assuming time only increases).

Yeah could definitely add it to the Base, probably as a separate set of 3 keywords?

jbarnoud commented 8 years ago

I should try how slow/fast is the mask method. start_time, stop_time, and skip_time would be fine by me.

orbeckst commented 8 years ago

@dotsdl was thinking about some of these things a while back – just pinging him in case he wants to chime in.

Time-aware slicing would be great. Perhaps also related to general "unit awareness" #596 ?

dotsdl commented 8 years ago

I think having a time slicer (something like Universe.trajectory.bytime[2.5:7.6]) would be awesome, and we could take our cues from how pandas implements floating point indexes (not easy, but possible). However, for efficient implementation with our Readers this would require e.g. the XTC reader to also get the time for each frame in addition to the offsets when it builds them.

dotsdl commented 8 years ago

This is more than a convenience, too. Speaking for myself I typically store intermediate datasets as pandas DataFrames with the index being time, not frame. Being able to use this directly on the trajectory to go to frames of interest would be a big deal.

mnmelo commented 8 years ago

@dotsdl, I'm not so sure that storing XTC frame times is the best idea: some trajectories might have all frames with the same time, others might be the result of concatenation and have the same time period repeated.

Can't we just fallback on the same strategy that assigns a dt if none is provided to the Reader?

jbarnoud commented 8 years ago

On 23-03-16 18:28, mnmelo wrote:

@dotsdl https://github.com/dotsdl, I'm not so sure that storing XTC frame times is the best idea: some trajectories might have all frames with the same time, others might be the result of concatenation and have the same time period repeated.

How does Gromacs analysis tolls deal with such trajectories?

Can't we just fallback on the same strategy that assigns a |dt| if none is provided to the |Reader|?

How do we currently deal with non homogeneous dt?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/800#issuecomment-200453767

dotsdl commented 8 years ago

some trajectories might have all frames with the same time, others might be the result of concatenation and have the same time period repeated.

@mnmelo yeah these are definitely sticky spots that I don't have good solutions for.

Can't we just fallback on the same strategy that assigns a dt if none is provided to the Reader?

Is that even necessary? I'm generally not a fan of guessing schemes, but I'm not even aware at the moment what we currently do. I guess it depends on how the Reader interprets the data?

mnmelo commented 8 years ago

@jbarnoud, @dotsdl, I had a brief contact with this aspect when writing the frame time for the ChainReader. Guessing based on initial frame time separation, and falling back to a default if the frames have the same time, isn't a terrible idea, as long as the behavior is well-documented.

Handling non-homogeneous dt is a lot more work. If we want to also go that way I vote for a Reader kwarg irregular_dt or similar, that'll prevent these routines from guessing dt and instead go through the frames' times. In that case I agree that storing times along with offsets becomes useful.

jbarnoud commented 8 years ago

Jumping from one frame to the other when the dt is homogeneous is just a matter of a division to know what the new frame should be. When the dt is not homogeneous, then things becomes more complicated.

It is rare that the dt changes, it is even rarer that the dt changes more than a few time throughout a trajectory. When going though the trajectory to build the offset, we could save when the dt changes. Then, when we jump to an other frame, if the target frame is still in the region with the same dt, we can do as if the dt was homogeneous, if the dt changes between the current frame and the target frame, we have to do as many division as we have different dt on the way (usually 2, maybe 3?).

It should not be to much of an overhead, and it would avoid assuming that the dt is homogeneous.

On 23-03-16 19:14, mnmelo wrote:

@jbarnoud https://github.com/jbarnoud, @dotsdl https://github.com/dotsdl, I had a brief contact with this aspect when writing the frame time for the |ChainReader|. Guessing based on initial frame time separation, and falling back to a default if the frames have the same time, isn't a terrible idea, as long as the behavior is well-documented.

Handling non-homogeneous |dt| is a lot more work. If we want to also go that way I vote for a |Reader| kwarg |irregular_dt| or similar, that'll prevent these routines from guessing |dt| and instead go through the frames' times. In that case I agree that storing times along with offsets becomes useful.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/800#issuecomment-200473347

mnmelo commented 8 years ago

That approach sounds good. What happens if a trajectory has multiple segments with overlapping times?

Note that this discussion is also valid for what should be returned when asking for trajectory.ts.time.

jbarnoud commented 8 years ago

On 23-03-16 19:29, mnmelo wrote:

That approach sounds good. What happens if a trajectory has multiple segments with overlapping times?

We should be able to identify those (dt becomes negative for one frame), but I do not know what to do in that case.

Note that this discussion is also valid for what should be returned when asking for |trajectory.ts.time|.

I would not try to outsmart the user. If the trajectory gives a time, we return that time. If the trajectory is odd, it is likely in purpose.

Also, having only one dt per reader bugs me. But it may be an other issue.

orbeckst commented 8 years ago

On 23 Mar, 2016, at 11:24, Jonathan Barnoud wrote:

When going though the trajectory to build the offset, we could save when the dt changes.

Why not just build all the times when building the offsets and keep the times with the offsets?

richardjgowers commented 8 years ago

@mnmelo I get that we could do fancy things with putting many trajectories together, but some basic attributes like that just need to report what was found in the data you provided.

As a way of iterating via time like the original question...

u = mda.Universe(top, [trj1, trj2])

actual_time = mda.guesstimate_time(u.trajectory)  # new method to do fancy guessing? or method of CHAIN

# OR

actual_time = u.trajectory.guess_continuous_time()

mask = actual_time > 10000 & actual_time < 250000

for ts in u.trajectory[mask]:
    # do stuff
jbarnoud commented 8 years ago

@orbeckst Keeping all the times would make the search for a given frame slower as it requires to go through all the time saved until we find the one we want (or to bisect until there). This is in O(nframes) or O(log(nframes)).

If you save the changes in dt in something like dt_changes = [(frame, time, dt), (frame, time, dt), ...] a given frame can be found with:

for frame, time, dt in dt_changes:
    if time <= target_time:
        last_dt = dt
        last_time = time
        last_frame = frame
    else:
        break
target_frame = (target_time - time) / last_dt + last_frame

The complexity of this is O(nchanges + 1). As the number of dt is likely to be very small, it is almost constant time.

EDIT: forgot to add the last frame in the snippet