Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Use --top-only for cmscan on 3D structures #146

Closed blakesweeney closed 6 months ago

blakesweeney commented 1 year ago

There is no need to run both strands of a 3D structure. It doesn't make sense to ask, 'does the reverse complement of this structure match a model' as we only want to know if the structure matches a model. Thus the cmscan step of the 3D alignment pipeline should use --top-only.

emmaco commented 1 year ago

So, using top only I have seen the following difference in the results of those families and their PDB IDs that were reporting “reverse order” hits.

For RF00254, we see the start and end values reversed: RF00254 6v5b D 92 12 54.1 5.00E-13 1 81 ebeb30 1 RF00254 6v5c D 92 12 54.1 5.00E-13 1 81 ebeb30 1 RF00254(top-only) 6v5b D 12 92 93.9 2.60E-25 1 81 ff87a4 1 RF00254(top-only) 6v5c D 12 92 93.9 2.60E-25 1 81 ff87a4 1

However we don’t see this for the other families, in this case RF00005, RF00106, RF01357, RF01330. These families and the PDB IDs in the reverse have no hits at all when using —toponly e.g.

This is the original result: RF01330 7mib J 33 1 51.5 6.60E-13 1 33 8e2511 1 Then when using toponly, RF01330 and PDB ID 7mib strand J is not in the output file.

These are the families and IDs with differences: --toponly comparison - Sheet1.csv

blakesweeney commented 1 year ago

I've looked at the changes and here is what I think.

Overall, I'd say merge it should be fine.