UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

cath.superpose ssaps files optimization ? #76

Open tubiana opened 3 years ago

tubiana commented 3 years ago

Dear all,

I'm facing challenging alignments (several times 1000+ structures). Since cath.superpose check if ssaps files exit, I found a way to speed up alignments by re-executing the cath.superpose command with random file order in the arguments (with a code example bellow, if it can be useful to someone);

But here's my question, I actually realised that all ssaps files pairs are computed

(base) thibault@XXX [XXX]/ssaps $ ls -l | grep A1A4S6 | grep B1AVH7
-rw-r--r-- 1 thibault ansatt     3080 Aug 21 10:20 A1A4S6.pdbB1AVH7.pdb.list
-rw-r--r-- 1 thibault ansatt       62 Aug 21 10:20 A1A4S6.pdbB1AVH7.pdb.scores
-rw-r--r-- 1 thibault ansatt     3080 Aug 21 15:37 B1AVH7.pdbA1A4S6.pdb.list
-rw-r--r-- 1 thibault ansatt       62 Aug 21 15:37 B1AVH7.pdbA1A4S6.pdb.scores
(base) thibault@XXX [XXX]/ssaps $ cat A1A4S6.pdbB1AVH7.pdb.scores
A1A4S6.pdb  B1AVH7.pdb  108   99  85.49   97   89   15   3.34
(base) thibault@XXX [XXX]/ssaps $ cat B1AVH7.pdbA1A4S6.pdb.scores
B1AVH7.pdb  A1A4S6.pdb   99  108  85.49   97   89   15   3.34

In some cases, I can have more than 10 million files in the same folder... I was thinking if there is a particular reason to generate all pairs? Maybe cath.superpose could gain in efficiency and storage if only one file for each pair is generated?

Wishing you a nice day 🙂 Best regards, Thibault.


Code example for running cath.superpose with random files order

export CATH_TOOLS_PDB_PATH=$WORKDIR
pdbinfile=""
for pdb in `ls $WORKDIR/*.pdb |sort -R`
do
  pdbinfile+="--pdb-infile $pdb "
done
#echo $pdbinfile
cath-superpose --do-the-ssaps ssaps --sup-to-pdb-files-dir output $pdbinfile
tonyelewis commented 3 years ago

Thank you for using cath-superpose and for giving us some of your feedback - much appreciated.

I'm not 100% clear about your point about things being sped up by randomising the order of the inputs. Is the point that you're using the --do-the-ssaps option of cath-superpose and you're running several of these at the same time? So you're using the randomisation as a way to parallelise the SSAPs that generate the alignments? In which case, it sounds like it would be valuable to you if there was an option to tell --do-the-ssaps to run n SSAP jobs in parallel. Is that correct?

In general, I think you're right that this area feels like it could be improved. We did enough work in this area to start generating good multiple structural alignments and to build something usable but we think we could do much better on the current trade-off between quality and computation time and on figuring out which SSAPs don't need to be performed.

However, for the issue you're talking about, I think we've already exploited the symmetry of only needing one alignment for each pair of structures: the code only SSAPs+uses the pair in the order of the first-specified-on-the-command-line first. So I suspect what's happening is that your randomisation also randomises the ordering it requires for each pair.

Does that sound right? Does this reinforce the idea that you'd benefit from an in-built way to parallelise the --do-the-ssaps?