Closed GoogleCodeExporter closed 9 years ago
This isn't intended as a 'defect'. Are special permissions required to file an
enhancement?
Original comment by dot...@gmail.com
on 28 Jan 2015 at 5:52
Not sure, I've changed it now.
I'm not an expert on XTC/TRR format, but the way I did number of frames in
another format was to know the size of a single frame and then divide the
filesize by this to work out number of frames. Would this not work for XTC/TRR?
Original comment by richardjgowers
on 28 Jan 2015 at 10:17
I don't know that this would work for XTCs at least since they are compressed,
and so the frames are of variable size on disk. It was for this reason the
indexing was implemented, AFAIK.
Original comment by dot...@gmail.com
on 28 Jan 2015 at 8:05
Ah ok, that's fine then. I think TRRs can have frames with/without
velocities/forces too so they're variable size too.
This looks good to me anyway.
Original comment by richardjgowers
on 29 Jan 2015 at 9:43
I think this is a good idea. We had discussed it when the offsets were
implemented (Issue 127) but I thought then it was maybe too much of a burden.
One of the reasons for that was how to be sure the index and the trajectory
matched. Checking the ctime seems a good approach but I can see some cases (a
copied directory without timestamp preservation, for instance) where it might
fail.
Also, are we dealing with cases where the user doesn't have write access to the
current dir?
Finally, I thought of two other ways to further ascertain the correspondence of
index and trajectory that could be implemented alongside (or instead of) ctime:
1- store and compare the filesize, and
2- load the offsets and proceed to load the last frame. With XTCs/TRRs, if
anything has changed this is very likely to no longer be valid.
Original comment by manuel.n...@gmail.com
on 29 Jan 2015 at 2:45
In cases where timestamps aren't preserved (but so long as something can be
obtained with os.getctime()), then the stored offsets will be ignored. Whatever
new ones that are then generated will overwrite the old ones.
I just added another commit to the branch that includes an exception to handle
this case. Basically, if writing the offsets fails it just moves on as if it
never tried to store them.
Since this feature comes entirely as a convenience, it should only work when it
can guarantee that the trajectory file hasn't changed since the offsets were
produced. Using ctime seemed the more conservative of the time-based checks one
could do, since using mtime would still allow the very small possibility of
another file with the same mtime replacing the trajectory. But we could
certainly add other criteria as well to make it safer.
Original comment by dot...@gmail.com
on 29 Jan 2015 at 5:19
"I just added another commit to the branch that includes an exception to handle
this case. Basically, if writing the offsets fails it just moves on as if it
never tried to store them."
"This case" meaning those in which the user doesn't have write access to the
directory of the trajectory.
Original comment by dot...@gmail.com
on 29 Jan 2015 at 5:21
Original comment by orbeckst
on 29 Jan 2015 at 5:31
David,
Generally +1 to the feature.
As you said, we need to make sure that it works conservatively, i.e. that there
isn't a chance that an unsuspecting user gets a crashing XTC reader under
normal operations. Therefore, I'd like to see additional checks along the lines
of what Manel suggests:
1) ctime (probably ok on Unix, might be iffy on Windows... if we ever get MDA
to work there...)
2) file size
3) try - load last frame - except
These tests should not incur much of a speed penalty.
The UnitTests should check at least four cases (for TRR and XTC):
1) trajectory without index file (as before)
2) trajectory plus appropriate index file
3) trajectory with WRONG index file
4) trajectory in directory without write permission (or disk full or any other
error related to processing the index)
Finally, there should also be docs on the feature that outline what you can and
cannot do with the index file (i.e. "How do I move trajectory and index?", "Can
I rename them?" etc)
Oliver
p.s.: David – you'll now be able to edit issues and the wiki
Original comment by orbeckst
on 29 Jan 2015 at 5:44
Instead of using ctime to tie trajectories to offset arrays, could you instead
use md5 hashes? Or maybe md5 hash of first kilobyte (or whatever smallest
frame is?) if calculating them over huge files is slow..
This would allow you to move the two together (trj and offsets) between
machines?
Original comment by richardjgowers
on 29 Jan 2015 at 6:47
I would prefer avoiding the use of hashes, since I don't know how reliable
hashing any particular part of the trajectory would be in determining whether
or not the offsets are stale. As far as moving the offsets with the
trajectories goes, under the current implementation it will mean that the
offsets will be regenerated.
IMHO this is fine, since the use of persistent offsets is a convenience, so if
they get regenerated for something as silly as renaming the file so be it. I'd
rather they do this than suddenly break the TrjReader.
Perhaps as some kind of middle ground one could add a keyword argument to
``Universe.__init__``that tells the universe to use any stored offsets it sees
for those trajectories without checking their ctime? Once loaded, the offsets
will be saved again with updated ctimes for each trajectory file.
Original comment by dot...@gmail.com
on 29 Jan 2015 at 7:00
Original comment by orbeckst
on 3 Feb 2015 at 11:29
Hi all,
I'm currently rewriting the XTC/TRR modules to replace the current SWIG
wrapping by a Cython-based solution (the point is that cython 1. is cleaner and
2. works seamlessly with Python 2 and 3... but that's not the point here) and I
am a bit surprised:
According to my tests, getting the offsets from a xtc file is less demanding
than loading the first frame and far far faster than loading the coordinates
file (e.g. the gro file)
As an example, on my computer, getting the offsets from a ~1GB xtc file with
2500 frames takes 2 ms!
Actually, when running Universe(coords.gro, traj.xtc), loading the coords.gro
is by far (several orders of magnitude!) slower than anything else!
Here are what I get (out of 5 reps) with one of my trajectories where the gro
file weighs 8MB (~12K atoms) and the xtc file is the one I mentioned previously:
1. Loading the gro file with "Universe(example.gro)" -> 38.1s
2. Loading the combo with "Universe(example.gro, example.xtc)" -> ~41.0s
3. Loading the offsets and loading the first frame using XTCReader from
"xdrfile.XTC" -> 15ms
4. Loading just the offsets with "libxdrfile2.read_xtc_numframes" -> 2ms
Am I just being lucky?
Séb
P.S: I attached the script I used to get those results so you can check too
with your trajectories
Original comment by sebastie...@gmail.com
on 4 Feb 2015 at 12:07
Attachments:
The python based PDB and GRO (and CRD) readers are a bottle neck. A major
improvement in usability would be a rewrite of those.
Original comment by orbeckst
on 4 Feb 2015 at 4:35
Séb and David,
@Séb: Can you open a new issue for the SWIG -> cython rewrite of the XDRReader
interface?
I think for 0.9.0 we stick with the SWIG one.
The main thing is then that you and David (dotsdl) should agree on what the
interface for automated offset loading should look like, i.e. is the current
python implementation ok or will this be moved to cython and if the latter,
would there be any changes to the public interface?
@David: Can you please describe the current interface (if there is any) to the
offset loading in a comment here?
Cheers,
Oliver
Original comment by orbeckst
on 4 Feb 2015 at 4:51
Would definitely be great to see performance improvement on reading GROMACS
file formats.
Original comment by tyler.je.reddy@gmail.com
on 4 Feb 2015 at 4:55
Séb,
Good idea for the rewrite. Just a couple notes of warning here:
1- since indexing only looks at frame headers it is very easy for the OS to
entirely cache that part of the trajectory, and therefore get very short
indexing times after the first one. Be sure you flush your IO caches before
timing stuff.
2- The offset array is allocated on the C-side and must be properly exposed to
Python with correct reference count and garbage-collection destructor to free
the memory. I have no idea how much of a task it'll be to convert what is there
now in SWIG.
Manel
Original comment by manuel.n...@gmail.com
on 5 Feb 2015 at 9:04
I answered by email but for some reasons, they didn't make it to the list...
@Oli
I opened 2 issues: one for the XDRReader and one for the GROReader.
I also think it is wiser (and safer) to wait for 1.0 to implement a
'Cythonified' solution
@Manel
1. Good point, I didn't think about the IO cache... Yet, it doesn't change the
face of the world:
After a proper IO cache flush, XTC reading with XTCReader take ~700ms instead
of the 15ms I reported yesterday. This is still far from the almost 40s needed
to load the .gro!
2. Memory management with Cython is really nice, this is one of the reason why
I like it so much!
@all
My question is more: is persistent storage of the offsets really needed or not?
My feeling is that it is not, but it's more like the current GRO file reading
is just painfully slow. A fast (cythonic?) reading routine would be great for
the 1.0 version! (Hence new issue 212)
Cheers,
Séb
Original comment by sebastie...@gmail.com
on 5 Feb 2015 at 4:36
"According to my tests, getting the offsets from a xtc file is less demanding
than loading the first frame and far far faster than loading the coordinates
file (e.g. the gro file)"
Séb, are these tests using your new Cython readers, or with the existing code?
Also, could you please post your updated testing script with io cache flushing?
I suspect that the fast times you are seeing for XTC reading are due to the
size of your trajectory. The trajectories I routinely work with have 10^5 to
10^6 frames (usually 1ps per frame). With loading on the order of ~700ms for a
trajectory with 2,500 frames, this would give a loading time for 10^6 frames of
~4.6 minutes. But my trajectories usually have nearly 140,000 atoms (10x your
test system's size), so this may be a very low estimate on the time it actually
takes.
Anecdotally, when loading a trajectory from many smaller XTC files, the process
of building indexes seems to take longer than if the trajectories were combined
into a single file. Using cached offsets has saved me a lot of time in my own
work with these large files.
When I have a hot minute I'll run your test script on one of my trajectories to
get some actual numbers to compare.
Cheers!
David
Original comment by dot...@gmail.com
on 6 Feb 2015 at 6:57
I have updated the offsets produced by MDAnalysis.coordinates.core.TrjReader to
include additional checks to ensure stale offsets aren't loaded. These now
include
1) checking the ctime of the trajectory against that stored with the offsets
2) checking the file size of the trajectory against that stored with the offsets
3) upon satisfying conditions 1) and 2), the offsets are loaded, and the last
frame is accessed; if this throws an exception due to the offsets being wrong,
the offsets are discarded and regenerated
I have added updated unittests to check that these mechanisms are working as
expected; these tests check that:
1) offsets are automatically written upon generation when they don't already
exist
2) ctime stored matches that of the trajectory file
3) size stored matches the file size of the trajectory file
4) stored offsets are loaded for a universe when these criteria are satisfied
5) stored offsets are NOT loaded when ctimes mismatch
6) stored offsets are NOT loaded when stored size does not match file size
7) stored offsets are NOT loaded when the offsets are wrong, resulting in a
failure to seek to last frame of trajectory
8) offsets aren't stored when directory containing trajectory is read-only
I have also added a keyword argument to TrjReader that will ignore stored
offsets if used. Otherwise, this behavior is intended to be automatic. It
remains possible to save and load offsets manually, however, using the same
methods as before (TrjReader.save_offsets(<filename>) and
TrjReader.load_offsets(<filename>)).
The tests currently pass on my "feature-persistentoffsets" here:
https://code.google.com/r/dotsdl-mdanalysis/
Original comment by dot...@gmail.com
on 28 Feb 2015 at 8:29
Ok I created a feature branch on the repo for this, and merged David's branch
onto develop.
https://code.google.com/p/mdanalysis/source/detail?r=09b18ed324670ffe0dc2402011b
7826cca009dbe&name=feature-persistentoffsets#
David - The merge was pretty messy (because of the pep8 changes that happened),
can you check I got everything 100%?
Original comment by richardjgowers
on 28 Feb 2015 at 2:43
I'm pulling the develop branch from https://code.google.com/p/mdanalysis , but
I'm not seeing the commit. In fact, I don't see anything since Feb. 19. Is this
the wrong repository?
Original comment by dot...@gmail.com
on 28 Feb 2015 at 8:51
Sorry, I meant that I merged your branch on top of develop in a new feature
branch called 'feature-persistentoffsets'
I rebased and squashed all the commits into a single commit too, so all the
changes should be in a single commit.
Original comment by richardjgowers
on 28 Feb 2015 at 8:56
Just checked, and it looks good (tests also work). In fact, I also found an
error in the documentation for one of the methods (TrjReader.load_offsets), and
I fixed it in a new commit to my updated 'feature-persistentoffsets' branch. If
it's not too much trouble to merge it in, it's there. Before I committed I set
the branch to match the 'feature-persistentoffsets' branch on the project's
repo. so it should be painless.
Original comment by dot...@gmail.com
on 28 Feb 2015 at 9:41
This issue was closed by revision c7c417d00645.
Original comment by dot...@gmail.com
on 6 Mar 2015 at 12:26
Original issue reported on code.google.com by
dot...@gmail.com
on 28 Jan 2015 at 5:50