Pick Top n compound with best docking score with `get_sdf_from_dock_db.py`

Feriolet commented 6 months ago

Can I know if it is possible to use get_sdf_from_dock_db.py to get the top n compound based on the lowest docking score? My initial plan is to use an SQLite editor to sort the docking_score and save the .db to a new file, and then use the script to get the sdf, but I'm curious if I can use the script directly.

I assume that with SQLite command, I would have to use SELECT TOP(n) ..... ORDER BY 'docking_score', but that may break the compatibility of the script. Or maybe using SELECT ..... ORDER BY 'docking_score' LIMIT n is also okay?

Feriolet commented 6 months ago

nvm I was able to get it by using the --add_sql 'docking_score IS NOT NULL ORDER BY "docking_score" asc LIMIT 50'

DrrDom commented 6 months ago

Yes, that was a trick :) --add_sql is a very powerful option.

This would be an very good addition for users not familiar with SQL to provide an option --ntop, where one can specify the number of top scored compounds to return. This will require to add your query to the whole query generated by the script. The drawback of this - it may limit --add_sql applicability. However, in this case it may be not that important. An ideal implementation could be add docking_score IS NOT NULL first, then append add_sql value and after that ORDER BY "docking_score" asc LIMIT ${ntop}.

Feriolet commented 6 months ago

I just realised that get_sdf_from_dock_db has WHERE mol_block IS NOT NULL by default. Since docking_score and mol_block will always be together (either filled or NULL), should we still include docking_score IS NOT NULL then? If it is redundant, then we can immediately use the ORDER BY "docking_score" asc LIMIT ${ntop} for the ntop option

DrrDom commented 6 months ago

I would say yes, docking_score is NULL is redundant. Thus, it should be easy to implement this feature by appending a query above if --ntop option is specified.

Feriolet commented 6 months ago

There's also ORDER BY when args.id exists, so does this work? If I'm right, I can't use ORDER BY twice in the SQLite query, and we need the comma , to order with multiple conditions. Or is it fine to separate them because both serve different purpose? I assume the purpose of using args.id is to get a specific id which is unrelated to docking score, and using args.ntop does not care about the specific id being extracted?

    if args.ids and args.ntop:
        case_str = ' '.join(f'WHEN "{mol_id}" THEN {i}' for i, mol_id in enumerate(ids, 1))
        sql += f" ORDER BY 'docking_score' asc LIMIT ${args.ntop}, CASE mols.id {case_str} END"
    elif args.ntop:
        sql += f" ORDER BY 'docking_score' asc LIMIT ${args.ntop}"
    elif args.ids:
        # https://dba.stackexchange.com/questions/302006/sqlite-return-rows-in-select-in-order
        case_str = ' '.join(f'WHEN "{mol_id}" THEN {i}' for i, mol_id in enumerate(ids, 1))
        sql += f" ORDER BY CASE mols.id {case_str} END"

DrrDom commented 6 months ago

So, if a user set both ids and ntop, the script will select top compounds from the given list. This looks reasonable.

However, there is another complication. Docking score is not always the lower the better. For gnina this is reversed - the larger the better ;) However, there is no way to reliably guess the direction.

Solutions 1: write to DB setup table a direction of docking score. The corresponding value should be provided by a docking script. In this case we have to insert to docking scripts a variable DOCKING_SCORE_DIRECTION which can take two values: -1 if lower the better and 1 of larger the better. This value will be read during initialization(?) (or upon the first call of a docking function) and stored in DB.

Solution 2: implement instead of ntop two arguments 'ntop_min' and ntop_max, which will select top compounds with lowest and highest scores. This looks easier, but less user-friendly, because it will require from a user the understanding with which scoring function he works.

Feriolet commented 6 months ago

For solution 2, then is it possible to ask the user to specify what docking program they use (args.program) then adjust accordingly to sort ascendingly or descendingly? However, solution 2 will require the user to remember which program they used to dock (not sure if this is a bad thing per say).

For solution 1, it may be better to store the sorting direction in the database, but i guess only database created after the feature implementation can use the ntop option.

DrrDom commented 6 months ago

Unfortunately even the same program may return several scores with opposite direction, e.g. gnina, which contains smina and vina inside whose scores are opposite to gnina score.

Solution 1 looks more user-friendly as a user should not know the direction of the score. However, its implementation will require some time to think how to do this in the best way to make it convenient to add new programs in future.

Feriolet commented 6 months ago

I see, then I agree the first solution is a better choice. Maybe we can insert the DOCKING_DIRECTION along with the other variable in the setup table? Since the setup table probably has the information we need inside the config file so we can also put it there when calling the create_db function

DrrDom commented 6 months ago

We have to take into account a possibility to separate initialization and docking stages as we discussed here https://github.com/ci-lab-cz/easydock/issues/38. If initialization of a database may be run without supplying of docking settings, information about docking program and scoring will not be available before calling create_db, it will become available only after docking will start. To avoid this complication we may strictly require to specify all settings by a user during init stage even if docking will not be run immediately.

Feriolet commented 6 months ago

So, should we try to accomodate the possibility that users separate initialization and docking stage? My only concern of having the DOCKING_SCORE_DIRECTION stored during the docking stage is that it will be accessed for every loop of the docking stage, even though we just need to store/access it once. I assuming that if we put it on the docking stage, it will be in the mol_dock script or docking(mols, ...) function in the run_dock.

ci-lab-cz / easydock

Pick Top n compound with best docking score with `get_sdf_from_dock_db.py` #37