MolecularAI / aizynthfinder

A tool for retrosynthetic planning
https://molecularai.github.io/aizynthfinder/
MIT License
562 stars 128 forks source link

A question regarding how to get template_hash column in trainning data? #123

Closed philipyang1 closed 1 year ago

philipyang1 commented 1 year ago

I notice the comment for template_hash is "a unique hash for each unique template. Will be used to filter uncommon templates.".
I have two questions here. Very Thanks! 1) If this template_hash is unique, why here we need to apply "group by" based on template_hash? in which case that the templete_group.size() value >1 instead of 1?

_template_group = full_data.groupby(template_hash_col)
template_group = template_group.size().sort_values(ascending=False)
min_index = template_group[template_group >= config["template_occurrence"]].index_

2) If I have a retro-template for a reaction, how can I get the template_hash for this template? can I use hash method like below?

_retro_canonical = products_string + '>>' + reactants_string hexhash = (retrocanonical).apply(lambda x: hashlib.sha256(x.encode('utf-8')).hexdigest())

Very Thanks! Philip Yang

SGenheden commented 1 year ago
  1. The code you are seeing are grouping the reactions based on template hash, and there could be many reactions being represented by the same template. So the grouping is used to figure out how many reactions there are per unique template.

  2. The template hash is calculated by the rxnutils package: https://molecularai.github.io/reaction_utils/templates.html It is the hash_from_bits method that is used in the latest publicly available models for AiZynthFinder.

philipyang1 commented 1 year ago

@SGenheden ,thanks for your answers.

  1. If the same template could represent many reactions. This grouping does make sense;-). I asked the chatgbt, The chatgbt explains this case and give me an example for this case ;-)
  2. Thanks again. I will check this package.

Philip Yang