GrossfieldLab / loos

LOOS: a lightweight object-oriented structure analysis library
http://grossfieldlab.github.io/loos/
GNU General Public License v3.0
122 stars 26 forks source link

svd vs big-svd, core dumping over atomic indices #106

Closed mmfarrugia closed 1 year ago

mmfarrugia commented 1 year ago

How big is too big for a normal svd? Also, is there a way to have big-svd process multiple trajectories for an SVD like svd does?

svd issue:

Warning: prmtop specifies non-rectangular box: ignoring

svd: SKIPPING ALIGNMENT svd: Extracting coordinates... terminate called after throwing an instance of 'std::out_of_range' what(): Bad index for an atom /opt/sge/crc/spool/d12chas537/job_scripts/296198: line 14: 3940703 Aborted (core dumped) svd -r 0:10:10000 -p svd_notH_enzyme_ts2_small_skip10 -N1 --splitv 1 --autoname 1 -S 'name=="!hydrogen"' ../universal-files/apo.ts2-strip.prmtop ts2/imaged/*.nc

I've tried a variety of things, seems that my system (802 residues) is too large because it runs fine on CA.

big-svd issue: does not accept multiple trajectories

For context: I am trying to do a PCA on 20 500ns trajectories of this 802 residue enzyme.
If an example file would be helpful for reproducing, I can provide that as well as more context.

agrossfield commented 1 year ago

It looks to me like what you're doing is pretty reasonable.

I have a few disparate thoughts:

One last thing: I note from the output you showed that you're not requesting an alignment first. That means the code assumes you've already aligned the trajectories. If that's NOT the case, svd will produce nonsense. If you've already pre-aligned the trajectories, great, but if not you'll need to add something like -A 'name=="CA"' to the command line. That has nothing to do with what's going wrong for now, but I figured it couldn't hurt to mention it.

tromo commented 1 year ago

Hi, Alan's right...there is no limit other than the amount of memory you have. The svd tool will spit out an estimate of the memory that it will require. You can check that to see if you need to consider switching to big-svd.

You are not reaching the part where svd can estimate the memory usage because it is failing while trying to build the source matrix. The error is a bad atom index which means that it's trying to find an atom that doesn't exist. This could be because of a mismatch between the model and trajectory files. I'd double check the files and make sure they all have the same number of atoms and in the same positions. It's also possible that something went wrong with the -S option. If you don't specify it, does it work? If it does, then try -S backbone and see if that works.

Unfortunately, there is no way to get big-svd to process multiple trajectories. big-svd was really mean to be a specialized tool for cases where the memory demands of svd was too much, and it requires you to do all of the pre-processing by hand (i.e. aligning and concatenating trajectories).

tromo commented 1 year ago

Hi Mikaela, another couple of suggestions for diagnosing the issue... Try running svd on a single trajectory with verbose output, i.e. -v 2 and your selection, -S '!hydrogen', and don't worry about the auto-split options. If that works, then try running it on two trajectories. If that works, then I'm inclined to think it's a problem with one (or more) of the trajectories. If it doesn't work, try repeating the same trajectory twice on the command line, i.e. svd -p test -S '!hydrogen' foo.pdb foo.dcd foo.dcd.

You can use subsetter to combine the trajectories and extract the svd selection. Again, add -v 2 to get verbose output. See if that works and, if it does, inspect the trajectory to make sure everything looks ok. You can also try running svd on this combined trajectory to see if it works (or doesn't).

There is an assumption by these tools that all trajectories have exactly the same model. Each trajectory must have the exact same number of atoms and they must be in the same order.

If everything checks out, but svd is still failing, then we'll need to investigate further. At that point, a minimal test system that demonstrates the issue that you'd be willing to share would be very helpful. We'll see if we can figure out what's going on...

mmfarrugia commented 1 year ago

Thank you for an incredibly quick response.

My problem is resolved by using the backbone or CB selection, but if this error is of concern then I'd be happy to send along further info or files from the troubleshooting you mentioned to clarify if it is a code issue. However, I do notice upon getting it to work for multiple trajectories that it seems to only outputting the V files for only the first 2/20 of my trajectories (assuming globbed in the typical order) so I will be using the verbosity and the aforementioned strategies to troubleshoot this and make sure that my SVD is being done on all 20 files.

Usually with github issues I don't expect to hear back for at least a week but this was lightning fast. Thank you again for such a speedy reply!

agrossfield commented 1 year ago

Glad we were helpful, though I'm not going to close this until we resolve the problem only outputting some _V files for you -- that should work correctly.

If you could figure out a minimal example that triggers one or both problems, that would let us try to figure out what's going wrong for you. It might be make sense to make downsampled versions of 5 of the trajectories with like 5 frames each, and send that to us if it triggers the problem. If that's still too big for GitHub to let you upload, let us know and I'll set up some other mechanism of transfer.

--Alan

agrossfield commented 1 year ago

One more question/comment. In your reply, you mention you have different topologies for the enzyme from different parts of the functional cycle. LOOS assumes that every frame in the trajectory (and the multiple trajectories are handled as if they were one) has the same number of atoms. If you have trajectories with differing numbers of atoms, weird things can happen. If there's a deletion, you could end up with a frame shift causing the calculation to be erroneous. If there's an insertion (so the trajectory has more atoms than the model), we think you could get the problem you're seeing (though we're having trouble tracking down exactly where it would happen).

If everything is working for you, cool. If you can find a subset of files sufficient to trigger the problem, we'll finish diagnosing it.

Cheers,

Alan

mmfarrugia commented 1 year ago

In this case, I'm only using trajectories from one part of the reaction. I only run into that issue when I am doing reference-based calculations because my reference structure is the ground state structure. I'm pretty certain that there are no topology issues because the same trajectories and topology pair works fine for my rmsd calculations via cpptraj.

I haven't had the chance to pare it down to a useful test case, but I will do that tomorrow morning and upload the files. However, I did do the same commands with the verbosity set to 2, and it certainly seems to be loading in the trajectories correctly:

Warning: prmtop specifies non-rectangular box: ignoring

Note- composite frame range used was '0:2:10000'

traj start end filename

0 0 9999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/60-3575-imaged.nc

1 10000 19999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/60-4689-imaged.nc

2 20000 29999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/60-7170-imaged.nc

3 30000 39999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/62-3180-imaged.nc

4 40000 49999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/6-3572-imaged.nc

5 50000 59999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/6-3726-imaged.nc

6 60000 69999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/6-6261-imaged.nc

7 70000 79999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/67-6112-imaged.nc

8 80000 89999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/67-6258-imaged.nc

9 90000 99999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/67-6514-imaged.nc

10 100000 109999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/67-7911-imaged.nc

11 110000 119999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/6-8180-imaged.nc

12 120000 129999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/82-6570-imaged.nc

13 130000 139999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/87-7758-imaged.nc

14 140000 149999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/91-6107-imaged.nc

15 150000 159999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/91-7366-imaged.nc

16 160000 169999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/95-5571-imaged.nc

17 170000 179999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/95-7572-imaged.nc

18 180000 189999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/97-1256-imaged.nc

19 190000 199999 /scratch365/mfarrugi/HMGR/500ns/ts2/imaged/97-5517-imaged.nc

svd: SKIPPING ALIGNMENT svd: Extracting coordinates... svd: Allocating estimated 2.177 GB for 14034 x 5001 SVD svd: SVD requests 25464090 extra space for a grand total of 2.36643 GB svd: Calculating SVD... svd: Done! Calculation took 45m54s svd: Writing results... svd: done!

I will also triple check the atom ordering as you suggested because I do know that amber sometimes reorders my atoms when it tries to be a little too helpful since I have custom atom types and residues.

Thank you again for your help and I will send over those files ASAP via github.

agrossfield commented 1 year ago

I appreciate it, Mikaela -- at this point, you're helping us more than we're helping you.

Cheers,

Alan

mmfarrugia commented 1 year ago

Oh no worries, it shouldn't take me long and I do like how quick the loos library is to use so far.

I double checked the atom order of the topology vs trajs and according to cpptraj no atom reordering is needed.

When reducing it to only 4 full-length trajectories (10,000 frames, only looking at every other, using the same range as before) it still only output the first two trajectories' V matrices. However, reducing those same 4 trajectories to 5 frames for the test case seems to have resulted in svd outputting all 4 V matrices as opposed to just the first two in the glob. There seems to be no difference in the verbose output between these different cases (other than what you would expect with the initial inventory of input since the input is of course different). I tried using the range argument to reduce the number of frames to equal that of the clipped trajectories which worked ^, but it seems that this results in the same issue of only the first two trajectories in the glob outputting V matrices. Thus, I can't really upload a test case for it failing. I'd need to test a bunch of different trajectories in order to find a reproducible case to upload and it seems like, by nature of it not working, I likely won't be able to upload it.

I do think that if you increase the length of the trajectories input then you should see this same behavior though as I think I've isolated it to an issue with input trajectory length. I hope this helps! If it does get identified and resolved, please do let me know as I will likely go back to using loos for my svd as opposed to MDAnalysis.

Thank you for your time!

agrossfield commented 1 year ago

Thank you: this summary of the behavior should help us at least analyze the code-paths to see if we can figure out why. I know this feature does at least in principle work (I’ve used it on data sets with 4-5 trajectories, each with lots of frames), so it’s really important we figure this out

One last question: have you tried permuting the order of the trajectories? Meaning, if “traj_1.nc traj_2.nc traj_3.nc traj_4.nc “ only produces _V files for the _1 and _2 trajectories and nothing for the others, what happens if you give it “traj_3.nc traj_4.nc traj_1.nc traj_2.nc”? It still feels to me like it’s failing to read some of the nc files correctly.

-Alan

tromo commented 1 year ago

Hi Mikaela, the -r option is used to pick frames from the composite (globbed) trajectory. If each trajectory has 10k frames in it and you specify -r 0:10:10000, then you are only taking frames from the first trajectory (and using a stride of 10). If you want to subsample everything with a stride of 10, then you'd need to say -r 0:10:200000, or set a stride for each of the globbed trajectories with -k 10 and not use a range. I'm pretty sure this is why the --splitv option doesn't seem to work.

I also just realized something. To select non-hydrogen atoms, you should use -s 'not hydrogen' or -s '!hydrogen'. I think problem is that you weren't actually selecting any atoms. We totally should have thrown an error when selection doesn't select any atoms (if this is the case).

agrossfield commented 1 year ago

I think we've now resolved where the remaining errors came from. Unless I hear otherwise, I'll close this issue in a couple of days.

mmfarrugia commented 1 year ago

Hello, sorry for the delay, I still haven't had the chance to further test this so I will go ahead and close it. Thank you!