Chap job crashes due to memory issues

channotation / chap

CHAP is a tool for the functional annotation of ion channel structures:

http://www.channotation.org

Other

18 stars 10 forks source link

Chap job crashes due to memory issues #8

Open eric-jm-lang opened 5 years ago

eric-jm-lang commented 5 years ago

Hi, I am trying to run chap on a 50000 frames trajectory. However the job dies after about 1000 frames, I believe because it feels up all the RAM available on my computer: I can see all the amount of memory used increases until it uses everything.

I have to run it with -dt 100 to have the calculation end successfully. Which means that I loose a significant portion of my MD.

Is this due to a memory leak or other bug? Or does the program needs to keep everything in memory? Is there a work around this problem? for example is it possible to force writing to disk instead of keeping in memory?

Many thanks in advance.

Inniag commented 4 years ago

First of all, I should mentione that analysing frames every 100 ps is likely not a big issue when determining time-averaged quantities over a long trajectory. In fact, you may even want to exclude frames that are only a short time apart in order to decorrelate the data. Still, the high memory demand for long trajectories is a problem I would like to solve, but it turns out to be complicated.

CHAP relies on libgromacs for parsing trajectories, which only reads one frame at a time (this is documented here). Thus, input trajectories are never kept in memory in their entirety. The problem therefore lies with the handling of output data. For this, CHAP uses the analysis data handling module provided by libgromacs. As far as I am aware, this module simply accumulates data over the entire trajectory and serialises it only at the end of the trajectory parsing process. I tried to work around this by manually serialising data after each frame (this is where the temporary output_stream.json file comes from), but I have not yet found a way to flush the AnalysisData container. Any help on this issue would be appreciated.

In terms of workaround, you could run CHAP on individual trajectory chunks (first 10 ns, second 10 ns, etc.) and write a Python script to combine the output data. I would need to know which quantity you are after in order judge how feasible this would be, but in principle, CHAP allows its users access to (nearly) all data produced internally.

eric-jm-lang commented 4 years ago

I appreciate your point regarding the decorrelation of data, however, in my case I have a region rarelly hydrated so in order to get sufficient data about the hydration, I wanted to try using more than 1 frame every 100 ps.

I understand the problem and I am affraid I wouldn't be able to help... One thing however is that is not clear to me is that the amount of memory use is far superior to the size of the trajectory itself: the memory consumption is over 60 GB after a few thousands frames analysed for a trajectory totalling 7 GB. What could be so large to use so much memory? Could there be a memory leack somewhere?

Indeed, I thought about doing this, but I don't know how to combine the json formated output files. I am performing what I would describe as a standard analysis (radius profile, solvent number density profiles, minimum solvent density, etc.), so I am interested in getting the pathwayProfile, pathwayScalarTimeSeries and pathwayProfileTimeSeries type of data. How easy would it be to combine the json files?

Many thanks!

channotation commented 4 years ago

Combining the JSON files should be very straightforward if you have any prior experience in e.g. Python programming (similar scripting languages like R will work as well). A JSON file maps one-to-one to a nested structure of Python lists and dictionaries. All you'd need to do is to lead all CHAP output files (each derived from its own trajectory chunk), extract the relevant data (see the documentation for details), and paste it together. Forming an average should then be straightforward with e.g. numpy.

eric-jm-lang commented 4 years ago

Thanks. I use Python. So I am trying to slice up my trajectory, run chap on trajectory chunks and then recombine them. I am having however a problem: the pathway is not always define in the same way for each chunk ( depending on the first frame I suppose. Here is an example: If I have two chunks analysed with CHAP that returned two JSON files. After loading the data (as data1 and data2), here are the results for the "s" array:

np.array(data1["pathwayProfileTimeSeries"]["s"])

returns: array([-4.34701633, -4.33732796, -4.32763958, ..., 5.31229877, 5.32198715, 5.33167553])

whereas np.array(data2["pathwayProfileTimeSeries"]["s"])

returns array([-4.38113976, -4.3716259 , -4.36211157, ..., 5.10431004, 5.11382389, 5.12333775])

How can this discrepency in the patway definition can be solved? Is there a way to tel chap to use e.g. a pdb file to define the pathway? Or to specify a file containing the the array of (unique) values of np.array(data1["pathwayProfileTimeSeries"]["s"]) as an input to CHAP?

Many thanks

Inniag commented 4 years ago

TLDR: You need to use interpolation to ensure that the s-coordinate is the same for all trajectory chunks. NumPy already provides a library for this: https://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html

Here's why: CHAP always writes 1000 data points for each profile (reducing this number to something like 10 points with the -out-num-points flag may help you develop your script), spanning the entire range between the two openings of the pore. Since the pore may have slightly different lengths in different frames, this means that the spacing of these points along the centre line (Delta s) is not equal across frames. The alternative would be to always use the same Delta s, but that would mean a different number of points in each frame, which would make post-processing of the data even more complex (as it would lead to different array dimensions from a Python point of view).

One more comment: There is no need to create trajectory chunks. CHAP can selectively analyse only a specific time range using the flags -b and -e (both in picoseconds). That way you don't need to create trajectory chunk files that might be quite storage intensive.

eric-jm-lang commented 4 years ago

Thanks for the suggestion and for the tips to specify a specific time range. I am afraid, however,that I fail to understand how to use numpy.interp in this case. Have you used such kind of script for this purpose before? Do you have any examples? Alternatively, I was thinking of adding at the beginning of each chunk of the trajectory the same first frame so the profile should always be the same and then removing those frames based on their indexes in the numpy arrays... It sounds convoluted, but at this stage it looks to me more straightforward than using interpolation.

eric-jm-lang commented 4 years ago

Also, based on the amount of RAM currently required to process my trajectories every 100 ps (this amount seems to increase linearly with the number of frames. I expect that if I wanted to process my full trajectory of 19 GB (corresponding to a 1.5 us trajectory with a frames saved every 10 ps), I would require more than 1TB of RAM! I am pretty sure that I have used programs that rely on libgromacs before, but I have never encountered one that requires such a large amount of RAM. Do you think that something during the analysis is saved in memory and this is not cleared after being used? Or something is saved multiple times in memory?