Closed mmfarrugia closed 1 year ago
It looks to me like what you're doing is pretty reasonable.
I have a few disparate thoughts:
I'm pretty sure (@tromo should chime in to confirm or tell me I'm an idiot) that svd doesn't have a size limit per se, other than what your machine's memory can hold. Moreover, I don't see the message where it's estimating how much memory it's going to need, which makes me think the failure point is in the call to extractCoords.
That error message looks like what you'd get if the prmtop and trajectory files weren't the same size, rather than an out of memory error.
The selection '!hydrogen' is going to grab all of the water oxygens, unless you've already removed them, which would make it untenably huge really fast. If you've stripped the waters out of the prmtops but haven't made new trajectories, that would do it. Setting that aside, doing PCA on all heavy atoms (including the side chains) is often not that helpful, because when there are multiple conformations for the side chains the averages (and thus the modes) can look really weird. Obviously, you know what you're trying to do better than I do, but you could consider using 'backbone'
or 'name == "CA" || name == "CB"'
Assuming I'm wrong and it is using too much memory (in which case we need to make the error messages MUCH clearer), you might be able to fix the problem by using fewer frames, eg by changing the range specification to '-r 0:100:10000'
. Similarly, using a smaller selection (like backbone, or 'backbone || name == "CB"'
) would help, since IIRC the memory cost is cubic in the number of atoms.
One last thing: I note from the output you showed that you're not requesting an alignment first. That means the code assumes you've already aligned the trajectories. If that's NOT the case, svd will produce nonsense. If you've already pre-aligned the trajectories, great, but if not you'll need to add something like -A 'name=="CA"'
to the command line. That has nothing to do with what's going wrong for now, but I figured it couldn't hurt to mention it.
Hi, Alan's right...there is no limit other than the amount of memory you have. The svd
tool will spit out an estimate of the memory that it will require. You can check that to see if you need to consider switching to big-svd
.
You are not reaching the part where svd
can estimate the memory usage because it is failing while trying to build the source matrix. The error is a bad atom index which means that it's trying to find an atom that doesn't exist. This could be because of a mismatch between the model and trajectory files. I'd double check the files and make sure they all have the same number of atoms and in the same positions. It's also possible that something went wrong with the -S
option. If you don't specify it, does it work? If it does, then try -S backbone
and see if that works.
Unfortunately, there is no way to get big-svd
to process multiple trajectories. big-svd
was really mean to be a specialized tool for cases where the memory demands of svd
was too much, and it requires you to do all of the pre-processing by hand (i.e. aligning and concatenating trajectories).
Hi Mikaela, another couple of suggestions for diagnosing the issue... Try running svd
on a single trajectory with verbose output, i.e. -v 2
and your selection, -S '!hydrogen'
, and don't worry about the auto-split options. If that works, then try running it on two trajectories. If that works, then I'm inclined to think it's a problem with one (or more) of the trajectories. If it doesn't work, try repeating the same trajectory twice on the command line, i.e. svd -p test -S '!hydrogen' foo.pdb foo.dcd foo.dcd
.
You can use subsetter
to combine the trajectories and extract the svd selection. Again, add -v 2
to get verbose output. See if that works and, if it does, inspect the trajectory to make sure everything looks ok. You can also try running svd
on this combined trajectory to see if it works (or doesn't).
There is an assumption by these tools that all trajectories have exactly the same model. Each trajectory must have the exact same number of atoms and they must be in the same order.
If everything checks out, but svd
is still failing, then we'll need to investigate further. At that point, a minimal test system that demonstrates the issue that you'd be willing to share would be very helpful. We'll see if we can figure out what's going on...
Thank you for an incredibly quick response.
My problem is resolved by using the backbone or CB selection, but if this error is of concern then I'd be happy to send along further info or files from the troubleshooting you mentioned to clarify if it is a code issue. However, I do notice upon getting it to work for multiple trajectories that it seems to only outputting the V files for only the first 2/20 of my trajectories (assuming globbed in the typical order) so I will be using the verbosity and the aforementioned strategies to troubleshoot this and make sure that my SVD is being done on all 20 files.
Usually with github issues I don't expect to hear back for at least a week but this was lightning fast. Thank you again for such a speedy reply!
Glad we were helpful, though I'm not going to close this until we resolve the problem only outputting some _V files for you -- that should work correctly.
If you could figure out a minimal example that triggers one or both problems, that would let us try to figure out what's going wrong for you. It might be make sense to make downsampled versions of 5 of the trajectories with like 5 frames each, and send that to us if it triggers the problem. If that's still too big for GitHub to let you upload, let us know and I'll set up some other mechanism of transfer.
--Alan
One more question/comment. In your reply, you mention you have different topologies for the enzyme from different parts of the functional cycle. LOOS assumes that every frame in the trajectory (and the multiple trajectories are handled as if they were one) has the same number of atoms. If you have trajectories with differing numbers of atoms, weird things can happen. If there's a deletion, you could end up with a frame shift causing the calculation to be erroneous. If there's an insertion (so the trajectory has more atoms than the model), we think you could get the problem you're seeing (though we're having trouble tracking down exactly where it would happen).
If everything is working for you, cool. If you can find a subset of files sufficient to trigger the problem, we'll finish diagnosing it.
Cheers,
Alan
In this case, I'm only using trajectories from one part of the reaction. I only run into that issue when I am doing reference-based calculations because my reference structure is the ground state structure. I'm pretty certain that there are no topology issues because the same trajectories and topology pair works fine for my rmsd calculations via cpptraj.
I haven't had the chance to pare it down to a useful test case, but I will do that tomorrow morning and upload the files. However, I did do the same commands with the verbosity set to 2, and it certainly seems to be loading in the trajectories correctly:
svd: SKIPPING ALIGNMENT svd: Extracting coordinates... svd: Allocating estimated 2.177 GB for 14034 x 5001 SVD svd: SVD requests 25464090 extra space for a grand total of 2.36643 GB svd: Calculating SVD... svd: Done! Calculation took 45m54s svd: Writing results... svd: done!
I will also triple check the atom ordering as you suggested because I do know that amber sometimes reorders my atoms when it tries to be a little too helpful since I have custom atom types and residues.
Thank you again for your help and I will send over those files ASAP via github.
I appreciate it, Mikaela -- at this point, you're helping us more than we're helping you.
Cheers,
Alan
Oh no worries, it shouldn't take me long and I do like how quick the loos library is to use so far.
I double checked the atom order of the topology vs trajs and according to cpptraj no atom reordering is needed.
When reducing it to only 4 full-length trajectories (10,000 frames, only looking at every other, using the same range as before) it still only output the first two trajectories' V matrices. However, reducing those same 4 trajectories to 5 frames for the test case seems to have resulted in svd outputting all 4 V matrices as opposed to just the first two in the glob. There seems to be no difference in the verbose output between these different cases (other than what you would expect with the initial inventory of input since the input is of course different). I tried using the range argument to reduce the number of frames to equal that of the clipped trajectories which worked ^, but it seems that this results in the same issue of only the first two trajectories in the glob outputting V matrices. Thus, I can't really upload a test case for it failing. I'd need to test a bunch of different trajectories in order to find a reproducible case to upload and it seems like, by nature of it not working, I likely won't be able to upload it.
I do think that if you increase the length of the trajectories input then you should see this same behavior though as I think I've isolated it to an issue with input trajectory length. I hope this helps! If it does get identified and resolved, please do let me know as I will likely go back to using loos for my svd as opposed to MDAnalysis.
Thank you for your time!
Thank you: this summary of the behavior should help us at least analyze the code-paths to see if we can figure out why. I know this feature does at least in principle work (I’ve used it on data sets with 4-5 trajectories, each with lots of frames), so it’s really important we figure this out
One last question: have you tried permuting the order of the trajectories? Meaning, if “traj_1.nc traj_2.nc traj_3.nc traj_4.nc “ only produces _V files for the _1 and _2 trajectories and nothing for the others, what happens if you give it “traj_3.nc traj_4.nc traj_1.nc traj_2.nc”? It still feels to me like it’s failing to read some of the nc files correctly.
-Alan
Hi Mikaela, the -r
option is used to pick frames from the composite (globbed) trajectory. If each trajectory has 10k frames in it and you specify -r 0:10:10000
, then you are only taking frames from the first trajectory (and using a stride of 10). If you want to subsample everything with a stride of 10, then you'd need to say -r 0:10:200000
, or set a stride for each of the globbed trajectories with -k 10
and not use a range. I'm pretty sure this is why the --splitv
option doesn't seem to work.
I also just realized something. To select non-hydrogen atoms, you should use -s 'not hydrogen'
or -s '!hydrogen'
. I think problem is that you weren't actually selecting any atoms. We totally should have thrown an error when selection doesn't select any atoms (if this is the case).
I think we've now resolved where the remaining errors came from. Unless I hear otherwise, I'll close this issue in a couple of days.
Hello, sorry for the delay, I still haven't had the chance to further test this so I will go ahead and close it. Thank you!
How big is too big for a normal svd? Also, is there a way to have big-svd process multiple trajectories for an SVD like svd does?
svd issue:
Warning: prmtop specifies non-rectangular box: ignoring
svd: SKIPPING ALIGNMENT svd: Extracting coordinates... terminate called after throwing an instance of 'std::out_of_range' what(): Bad index for an atom /opt/sge/crc/spool/d12chas537/job_scripts/296198: line 14: 3940703 Aborted (core dumped) svd -r 0:10:10000 -p svd_notH_enzyme_ts2_small_skip10 -N1 --splitv 1 --autoname 1 -S 'name=="!hydrogen"' ../universal-files/apo.ts2-strip.prmtop ts2/imaged/*.nc
I've tried a variety of things, seems that my system (802 residues) is too large because it runs fine on CA.
big-svd issue: does not accept multiple trajectories
For context: I am trying to do a PCA on 20 500ns trajectories of this 802 residue enzyme.
If an example file would be helpful for reproducing, I can provide that as well as more context.