Persistent offsets for xdrfiles (XTC and TRR)

GoogleCodeExporter commented 9 years ago

Code submission: XTC and TRR files now write offsets to disk.

It is a common problem that if one has large XTC or TRR files, it can take a 
long time to get the number of frames when loading these as coordinates of a 
universe. Getting the number of frames is necessary for the chain reader 
upfront, so when loading multiple XTC or TRR files this may mean a substantial 
wait (perhaps many minutes) before the universe is ready for use.

When the number of frames is obtained, an index of the trajectory file giving 
the offsets of its frames is also stored. This allows for quick random access 
through the trajectory thereafter. Until now, this index is lost when python 
session ends. Now, these offsets are stored as (small) hidden pickle files in 
the same directory as the trajectories with names 
``.<nameoftrajectory>_offsets.pkl``.

When a universe is loaded with XTC or TRR files, the TrjReader tries to load 
these offsets. If they are present, it then checks the ctime of the trajectory 
file against that stored in the offset file. If they match, the offsets are 
applied. Otherwise, they are ignored and eventually overwritten when numframes 
is called.

The commits can be found in my repository here: 
``https://code.google.com/r/dotsdl-mdanalysis/``

The branch is called ``feature-persistentoffsets``

I have also updated the xdrfile tests to reflect the changes.

Original issue reported on code.google.com by dot...@gmail.com on 28 Jan 2015 at 5:50

GoogleCodeExporter commented 9 years ago

This isn't intended as a 'defect'. Are special permissions required to file an 
enhancement?

Original comment by dot...@gmail.com on 28 Jan 2015 at 5:52

GoogleCodeExporter commented 9 years ago

Not sure, I've changed it now.

I'm not an expert on XTC/TRR format, but the way I did number of frames in 
another format was to know the size of a single frame and then divide the 
filesize by this to work out number of frames.  Would this not work for XTC/TRR?

Original comment by richardjgowers on 28 Jan 2015 at 10:17

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

I don't know that this would work for XTCs at least since they are compressed, 
and so the frames are of variable size on disk. It was for this reason the 
indexing was implemented, AFAIK.

Original comment by dot...@gmail.com on 28 Jan 2015 at 8:05

GoogleCodeExporter commented 9 years ago

Ah ok, that's fine then.  I think TRRs can have frames with/without 
velocities/forces too so they're variable size too.

This looks good to me anyway.

Original comment by richardjgowers on 29 Jan 2015 at 9:43

GoogleCodeExporter commented 9 years ago

I think this is a good idea. We had discussed it when the offsets were 
implemented (Issue 127) but I thought then it was maybe too much of a burden. 
One of the reasons for that was how to be sure the index and the trajectory 
matched. Checking the ctime seems a good approach but I can see some cases (a 
copied directory without timestamp preservation, for instance) where it might 
fail.

Also, are we dealing with cases where the user doesn't have write access to the 
current dir?

Finally, I thought of two other ways to further ascertain the correspondence of 
index and trajectory that could be implemented alongside (or instead of) ctime:
1- store and compare the filesize, and
2- load the offsets and proceed to load the last frame. With XTCs/TRRs, if 
anything has changed this is very likely to no longer be valid.

Original comment by manuel.n...@gmail.com on 29 Jan 2015 at 2:45

GoogleCodeExporter commented 9 years ago

In cases where timestamps aren't preserved (but so long as something can be 
obtained with os.getctime()), then the stored offsets will be ignored. Whatever 
new ones that are then generated will overwrite the old ones.

I just added another commit to the branch that includes an exception to handle 
this case. Basically, if writing the offsets fails it just moves on as if it 
never tried to store them.

Since this feature comes entirely as a convenience, it should only work when it 
can guarantee that the trajectory file hasn't changed since the offsets were 
produced. Using ctime seemed the more conservative of the time-based checks one 
could do, since using mtime would still allow the very small possibility of 
another file with the same mtime replacing the trajectory. But we could 
certainly add other criteria as well to make it safer.

Original comment by dot...@gmail.com on 29 Jan 2015 at 5:19

GoogleCodeExporter commented 9 years ago

"I just added another commit to the branch that includes an exception to handle 
this case. Basically, if writing the offsets fails it just moves on as if it 
never tried to store them."

"This case" meaning those in which the user doesn't have write access to the 
directory of the trajectory.

Original comment by dot...@gmail.com on 29 Jan 2015 at 5:21

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 29 Jan 2015 at 5:31

Changed state: Started

GoogleCodeExporter commented 9 years ago

David,

Generally +1 to the feature.

As you said, we need to make sure that it works conservatively, i.e. that there 
isn't a chance that an unsuspecting user gets a crashing XTC reader under 
normal operations. Therefore, I'd like to see additional checks along the lines 
of what Manel suggests:

1) ctime (probably ok on Unix, might be iffy on Windows... if we ever get MDA 
to work there...)
2) file size
3) try - load last frame - except

These tests should not incur much of a speed penalty.

The UnitTests should check at least four cases (for TRR and XTC):

1) trajectory without index file (as before)
2) trajectory plus appropriate index file
3) trajectory with WRONG index file
4) trajectory in directory without write permission (or disk full or any other 
error related to processing the index)

Finally, there should also be docs on the feature that outline what you can and 
cannot do with the index file (i.e. "How do I move trajectory and index?", "Can 
I rename them?" etc)

Oliver

p.s.: David – you'll now be able to edit issues and the wiki

Original comment by orbeckst on 29 Jan 2015 at 5:44

GoogleCodeExporter commented 9 years ago

Instead of using ctime to tie trajectories to offset arrays, could you instead 
use md5 hashes?  Or maybe md5 hash of first kilobyte (or whatever smallest 
frame is?) if calculating them over huge files is slow..

This would allow you to move the two together (trj and offsets) between 
machines?

Original comment by richardjgowers on 29 Jan 2015 at 6:47

GoogleCodeExporter commented 9 years ago

I would prefer avoiding the use of hashes, since I don't know how reliable 
hashing any particular part of the trajectory would be in determining whether 
or not the offsets are stale. As far as moving the offsets with the 
trajectories goes, under the current implementation it will mean that the 
offsets will be regenerated.

IMHO this is fine, since the use of persistent offsets is a convenience, so if 
they get regenerated for something as silly as renaming the file so be it. I'd 
rather they do this than suddenly break the TrjReader.

Perhaps as some kind of middle ground one could add a keyword argument to 
``Universe.__init__``that tells the universe to use any stored offsets it sees 
for those trajectories without checking their ctime? Once loaded, the offsets 
will be saved again with updated ctimes for each trajectory file.

Original comment by dot...@gmail.com on 29 Jan 2015 at 7:00

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 3 Feb 2015 at 11:29

Added labels: Milestone-Release0.9

GoogleCodeExporter commented 9 years ago

Hi all,

I'm currently rewriting the XTC/TRR modules to replace the current SWIG 
wrapping by a Cython-based solution (the point is that cython 1. is cleaner and 
2. works seamlessly with Python 2 and 3... but that's not the point here) and I 
am a bit surprised:
According to my tests, getting the offsets from a xtc file is less demanding 
than loading the first frame and far far faster than loading the coordinates 
file (e.g. the gro file)
As an example, on my computer, getting the offsets from a ~1GB xtc file with 
2500 frames takes 2 ms!
Actually, when running Universe(coords.gro, traj.xtc), loading the coords.gro 
is by far (several orders of magnitude!) slower than anything else!
Here are what I get (out of 5 reps) with one of my trajectories where the gro 
file weighs 8MB (~12K atoms) and the xtc file is the one I mentioned previously:
1. Loading the gro file with "Universe(example.gro)" -> 38.1s
2. Loading the combo with "Universe(example.gro, example.xtc)" -> ~41.0s
3. Loading the offsets and loading the first frame using XTCReader from 
"xdrfile.XTC" -> 15ms
4. Loading just the offsets with "libxdrfile2.read_xtc_numframes" -> 2ms

Am I just being lucky?

Séb

P.S: I attached the script I used to get those results so you can check too 
with your trajectories

Original comment by sebastie...@gmail.com on 4 Feb 2015 at 12:07

Attachments:

test.py

GoogleCodeExporter commented 9 years ago

The python based PDB and GRO (and CRD) readers are a bottle neck. A major 
improvement in usability would be a rewrite of those.

Original comment by orbeckst on 4 Feb 2015 at 4:35

GoogleCodeExporter commented 9 years ago

Séb and David,

@Séb: Can you open a new issue for the SWIG -> cython rewrite of the XDRReader 
interface?

I think for 0.9.0 we stick with the SWIG one. 

The main thing is then that you and David (dotsdl) should agree on what the 
interface for automated offset loading should look like, i.e. is the current 
python implementation ok or will this be moved to cython and if the latter, 
would there be any changes to the public interface?

@David: Can you please describe the current interface (if there is any) to the 
offset loading in a comment here?

Cheers,
Oliver

Original comment by orbeckst on 4 Feb 2015 at 4:51

GoogleCodeExporter commented 9 years ago

Would definitely be great to see performance improvement on reading GROMACS 
file formats.

Original comment by tyler.je.reddy@gmail.com on 4 Feb 2015 at 4:55

GoogleCodeExporter commented 9 years ago

Séb,

Good idea for the rewrite. Just a couple notes of warning here:

1- since indexing only looks at frame headers it is very easy for the OS to 
entirely cache that part of the trajectory, and therefore get very short 
indexing times after the first one. Be sure you flush your IO caches before 
timing stuff.

2- The offset array is allocated on the C-side and must be properly exposed to 
Python with correct reference count and garbage-collection destructor to free 
the memory. I have no idea how much of a task it'll be to convert what is there 
now in SWIG.

Manel

Original comment by manuel.n...@gmail.com on 5 Feb 2015 at 9:04

GoogleCodeExporter commented 9 years ago

I answered by email but for some reasons, they didn't make it to the list...

@Oli
I opened 2 issues: one for the XDRReader and one for the GROReader.

I also think it is wiser (and safer) to wait for 1.0 to implement a 
'Cythonified' solution

@Manel
1. Good point, I didn't think about the IO cache... Yet, it doesn't change the 
face of the world:

After a proper IO cache flush, XTC reading with XTCReader take ~700ms instead 
of the 15ms I reported yesterday. This is still far from the almost 40s needed 
to load the .gro!

2. Memory management with Cython is really nice, this is one of the reason why 
I like it so much!

@all
My question is more: is persistent storage of the offsets really needed or not?
My feeling is that it is not, but it's more like the current GRO file reading 
is just painfully slow. A fast (cythonic?) reading routine would be great for 
the 1.0 version! (Hence new issue 212)

Cheers,

Séb

Original comment by sebastie...@gmail.com on 5 Feb 2015 at 4:36

GoogleCodeExporter commented 9 years ago

"According to my tests, getting the offsets from a xtc file is less demanding 
than loading the first frame and far far faster than loading the coordinates 
file (e.g. the gro file)"

Séb, are these tests using your new Cython readers, or with the existing code? 
Also, could you please post your updated testing script with io cache flushing?

I suspect that the fast times you are seeing for XTC reading are due to the 
size of your trajectory. The trajectories I routinely work with have 10^5 to 
10^6 frames (usually 1ps per frame). With loading on the order of ~700ms for a 
trajectory with 2,500 frames, this would give a loading time for 10^6 frames of 
~4.6 minutes. But my trajectories usually have nearly 140,000 atoms (10x your 
test system's size), so this may be a very low estimate on the time it actually 
takes.

Anecdotally, when loading a trajectory from many smaller XTC files, the process 
of building indexes seems to take longer than if the trajectories were combined 
into a single file. Using cached offsets has saved me a lot of time in my own 
work with these large files.

When I have a hot minute I'll run your test script on one of my trajectories to 
get some actual numbers to compare.

Cheers!

David

Original comment by dot...@gmail.com on 6 Feb 2015 at 6:57

GoogleCodeExporter commented 9 years ago

I have updated the offsets produced by MDAnalysis.coordinates.core.TrjReader to 
include additional checks to ensure stale offsets aren't loaded. These now 
include

1) checking the ctime of the trajectory against that stored with the offsets
2) checking the file size of the trajectory against that stored with the offsets
3) upon satisfying conditions 1) and 2), the offsets are loaded, and the last 
frame is accessed; if this throws an exception due to the offsets being wrong, 
the offsets are discarded and regenerated

I have added updated unittests to check that these mechanisms are working as 
expected; these tests check that:

1) offsets are automatically written upon generation when they don't already 
exist
2) ctime stored matches that of the trajectory file
3) size stored matches the file size of the trajectory file
4) stored offsets are loaded for a universe when these criteria are satisfied
5) stored offsets are NOT loaded when ctimes mismatch
6) stored offsets are NOT loaded when stored size does not match file size
7) stored offsets are NOT loaded when the offsets are wrong, resulting in a 
failure to seek to last frame of trajectory
8) offsets aren't stored when directory containing trajectory is read-only

I have also added a keyword argument to TrjReader that will ignore stored 
offsets if used. Otherwise, this behavior is intended to be automatic. It 
remains possible to save and load offsets manually, however, using the same 
methods as before (TrjReader.save_offsets(<filename>) and 
TrjReader.load_offsets(<filename>)).

The tests currently pass on my "feature-persistentoffsets" here: 
https://code.google.com/r/dotsdl-mdanalysis/

Original comment by dot...@gmail.com on 28 Feb 2015 at 8:29

GoogleCodeExporter commented 9 years ago

Ok I created a feature branch on the repo for this, and merged David's branch 
onto develop.

https://code.google.com/p/mdanalysis/source/detail?r=09b18ed324670ffe0dc2402011b
7826cca009dbe&name=feature-persistentoffsets#

David - The merge was pretty messy (because of the pep8 changes that happened), 
can you check I got everything 100%?

Original comment by richardjgowers on 28 Feb 2015 at 2:43

GoogleCodeExporter commented 9 years ago

I'm pulling the develop branch from https://code.google.com/p/mdanalysis , but 
I'm not seeing the commit. In fact, I don't see anything since Feb. 19. Is this 
the wrong repository?

Original comment by dot...@gmail.com on 28 Feb 2015 at 8:51

GoogleCodeExporter commented 9 years ago

Sorry, I meant that I merged your branch on top of develop in a new feature 
branch called 'feature-persistentoffsets'

I rebased and squashed all the commits into a single commit too, so all the 
changes should be in a single commit.

Original comment by richardjgowers on 28 Feb 2015 at 8:56

GoogleCodeExporter commented 9 years ago

Just checked, and it looks good (tests also work). In fact, I also found an 
error in the documentation for one of the methods (TrjReader.load_offsets), and 
I fixed it in a new commit to my updated 'feature-persistentoffsets' branch. If 
it's not too much trouble to merge it in, it's there. Before I committed I set 
the branch to match the 'feature-persistentoffsets' branch on the project's 
repo. so it should be painless.

Original comment by dot...@gmail.com on 28 Feb 2015 at 9:41

GoogleCodeExporter commented 9 years ago

This issue was closed by revision c7c417d00645.

Original comment by dot...@gmail.com on 6 Mar 2015 at 12:26

Changed state: Fixed

djay0529 / mdanalysis

Persistent offsets for xdrfiles (XTC and TRR) #208