Open Feriolet opened 6 months ago
nvm I was able to get it by using the --add_sql 'docking_score IS NOT NULL ORDER BY "docking_score" asc LIMIT 50'
Yes, that was a trick :) --add_sql
is a very powerful option.
This would be an very good addition for users not familiar with SQL to provide an option --ntop
, where one can specify the number of top scored compounds to return. This will require to add your query to the whole query generated by the script. The drawback of this - it may limit --add_sql
applicability. However, in this case it may be not that important. An ideal implementation could be add docking_score IS NOT NULL
first, then append add_sql
value and after that ORDER BY "docking_score" asc LIMIT ${ntop}
.
I just realised that get_sdf_from_dock_db
has WHERE mol_block IS NOT NULL
by default. Since docking_score
and mol_block
will always be together (either filled or NULL), should we still include docking_score IS NOT NULL
then? If it is redundant, then we can immediately use the ORDER BY "docking_score" asc LIMIT ${ntop}
for the ntop
option
I would say yes, docking_score is NULL
is redundant. Thus, it should be easy to implement this feature by appending a query above if --ntop
option is specified.
There's also ORDER BY
when args.id
exists, so does this work? If I'm right, I can't use ORDER BY
twice in the SQLite query, and we need the comma ,
to order with multiple conditions. Or is it fine to separate them because both serve different purpose? I assume the purpose of using args.id is to get a specific id which is unrelated to docking score, and using args.ntop does not care about the specific id being extracted?
if args.ids and args.ntop:
case_str = ' '.join(f'WHEN "{mol_id}" THEN {i}' for i, mol_id in enumerate(ids, 1))
sql += f" ORDER BY 'docking_score' asc LIMIT ${args.ntop}, CASE mols.id {case_str} END"
elif args.ntop:
sql += f" ORDER BY 'docking_score' asc LIMIT ${args.ntop}"
elif args.ids:
# https://dba.stackexchange.com/questions/302006/sqlite-return-rows-in-select-in-order
case_str = ' '.join(f'WHEN "{mol_id}" THEN {i}' for i, mol_id in enumerate(ids, 1))
sql += f" ORDER BY CASE mols.id {case_str} END"
So, if a user set both ids
and ntop
, the script will select top compounds from the given list. This looks reasonable.
However, there is another complication. Docking score is not always the lower the better. For gnina this is reversed - the larger the better ;) However, there is no way to reliably guess the direction.
Solutions 1:
write to DB setup table a direction of docking score. The corresponding value should be provided by a docking script. In this case we have to insert to docking scripts a variable DOCKING_SCORE_DIRECTION
which can take two values: -1 if lower the better and 1 of larger the better. This value will be read during initialization(?) (or upon the first call of a docking function) and stored in DB.
Solution 2:
implement instead of ntop
two arguments 'ntop_min' and ntop_max
, which will select top compounds with lowest and highest scores. This looks easier, but less user-friendly, because it will require from a user the understanding with which scoring function he works.
For solution 2, then is it possible to ask the user to specify what docking program they use (args.program
) then adjust accordingly to sort ascendingly or descendingly? However, solution 2 will require the user to remember which program they used to dock (not sure if this is a bad thing per say).
For solution 1, it may be better to store the sorting direction in the database, but i guess only database created after the feature implementation can use the ntop
option.
Unfortunately even the same program may return several scores with opposite direction, e.g. gnina, which contains smina
and vina
inside whose scores are opposite to gnina
score.
Solution 1 looks more user-friendly as a user should not know the direction of the score. However, its implementation will require some time to think how to do this in the best way to make it convenient to add new programs in future.
I see, then I agree the first solution is a better choice. Maybe we can insert the DOCKING_DIRECTION
along with the other variable in the setup
table? Since the setup
table probably has the information we need inside the config file so we can also put it there when calling the create_db
function
We have to take into account a possibility to separate initialization and docking stages as we discussed here https://github.com/ci-lab-cz/easydock/issues/38. If initialization of a database may be run without supplying of docking settings, information about docking program and scoring will not be available before calling create_db
, it will become available only after docking will start. To avoid this complication we may strictly require to specify all settings by a user during init stage even if docking will not be run immediately.
So, should we try to accomodate the possibility that users separate initialization and docking stage? My only concern of having the DOCKING_SCORE_DIRECTION
stored during the docking stage is that it will be accessed for every loop of the docking stage, even though we just need to store/access it once. I assuming that if we put it on the docking stage, it will be in the mol_dock
script or docking(mols, ...)
function in the run_dock
.
Can I know if it is possible to use
get_sdf_from_dock_db.py
to get the top n compound based on the lowest docking score? My initial plan is to use an SQLite editor to sort the docking_score and save the.db
to a new file, and then use the script to get the sdf, but I'm curious if I can use the script directly.I assume that with SQLite command, I would have to use
SELECT TOP(n) ..... ORDER BY 'docking_score'
, but that may break the compatibility of the script. Or maybe usingSELECT ..... ORDER BY 'docking_score' LIMIT n
is also okay?