Kohulan / Smiles-TO-iUpac-Translator

Transformer based SMILES to IUPAC Translator
MIT License
137 stars 27 forks source link

Possibility to change the input of models to accept a batch as a collection of SMILES/names? #26

Open alexey-krasnov opened 3 months ago

alexey-krasnov commented 3 months ago

Dear @Kohulan,

I'm wondering if it is possible to adjust the model so that it can accept multiple inputs. For example, input might be a batch of smiles / names, therefore increasing the performance of the model.

Right now the only way for multiple inputs is just passing them one by one in a for-loop:

# SMILES to IUPAC name translation
smiles_list = ['CC(=O)OC(CC(=O)O)C[N+](C)(C)C',
             'CC(CN)O',
             'C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl',
             'CCN1C=NC2=C(N=CN=C21)N',]

for smiles in smiles_list:
    IUPAC_names = translate_forward(smiles)
    print("IUPAC name of "+smiles+" is: "+IUPAC_names)

Do you think it is possible to implement as an input a batch (collection of SMILES/names) to have something like that:

# SMILES to IUPAC name translation
smiles_list = ['CC(=O)OC(CC(=O)O)C[N+](C)(C)C',
             'CC(CN)O',
             'C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl',
             'CCN1C=NC2=C(N=CN=C21)N',]

IUPAC_names = translate_forward(smiles_list)

Is it feasible by changing input shapes and do necessary preprocessing of input data or the only way is to re-train/fine-tune the model?

Kohulan commented 3 months ago

Hi @alexey-krasnov,

I kept the translate_forward function as simple as possible for ease of handling within the package. This design allows anyone interested to parallelize the function if needed. Additionally, the model currently accepts only single-line inputs, so handling arrays directly is not feasible.

Kind regards,
Kohulan

parkyjmit commented 3 months ago

Are there any plans to add a batch feature? This would be a significant improvement, and I would really appreciate having this feature as well.

alexey-krasnov commented 3 months ago

Dear @Kohulan ,

thanks for your explanation. I'll try to check more the possibility of parallelization. I did quick tests by turning off native Tensorflow parallelization:

# Set the number of inter-op parallelism threads
tf.config.threading.set_inter_op_parallelism_threads(1)

# Set the number of intra-op parallelism threads
tf.config.threading.set_intra_op_parallelism_threads(1)

and further implemented ProcessPoolExecutor from concurrent.futures, however it did not bring any gain in performance. Actually it was even slower than for-loop, probably due to Python expences for parallelization.

Best regards, Aleksei

Kohulan commented 3 months ago

Dear @alexey-krasnov ,

Thanks for investigating this and I am happy to take pull requests.

Kind regards, Kohulan