deepchem / deepchem

Democratizing Deep-Learning for Drug Discovery, Quantum Chemistry, Materials Science and Biology
https://deepchem.io/
MIT License
5.5k stars 1.68k forks source link

Training a Normalizing Flow: “module 'selfies' has no attribute 'multiple_selfies_to_hot'” #2308

Open alat-rights opened 3 years ago

alat-rights commented 3 years ago

Hey everyone,

I’m modifying @ncfrey ’s Normalizing Flow tutorial to work on the MolNet BBBP dataset, and I’m facing an issue.

The code block that throws the error is: onehots = sf.multiple_selfies_to_hot(selfies_list, largest_selfie_len, selfies_alphabet)

The error thrown is: module 'selfies' has no attribute 'multiple_selfies_to_hot'

Was wondering if someone knew why that could be the case?

rbharath commented 3 years ago

@alat-rights I'll defer to @ncfrey but I think that you might possibly have an old version of selfies? I think multiple_selfies_to_hot was a new utility method @ncfrey added in the most recent version of selfies

alat-rights commented 3 years ago

Interesting. I ran the import code given by Nathan’s tutorial in Colab. I’ll take a closer look and report back.

EDIT Hey! So I looked over the code and made sure that I was importing the latest version of SELFIES, selfies-1.0.2. I think the issue is something else.

EDIT Hey, update: my friend thinks that the issue might be that 'multiple_selfies_to_hot' isn't actually a documented fn. (I found it in a buried file called data_loader.py in SELFIES). Could it be that the function can't be called because it's not exported? In that case, how did the tutorial run it?

alat-rights commented 3 years ago

Hey @seyonechithrananda could you take a look here? I could only find multiple_selfies_to_hot in two data_loader.py files, one in examples and one in code from original code. Outside, there is a batch_selfies_to_flat_hot and I tried to call that instead but it behaves differently, apparently.

I was wondering if you could help give me some insight? How should I go about calling multiple_selfies_to_hot? Have there been changes to it in SELFIES since Sept. when @ncfrey wrote the tutorial?

Thank you so much in advance!

EDIT I tried rolling back to v1.0.1 from v1.0.2 and it did not resolve or change the issue.

ncfrey commented 3 years ago

Hi @alat-rights, thanks for this catch! It looks like they have renamed this function to batch_selfies_to_flat_hot in selfies.utils, so the tutorial needs to be updated. If you're running the latest version of selfies you should be able to modify the function name to get it to work. https://github.com/aspuru-guzik-group/selfies/blob/64b144e1fc864c1246b8c0c2c7c36ff4696057c2/selfies/utils.py#L198-L202

alat-rights commented 3 years ago

Hi @ncfrey ! Thanks for the timely reply. I tried to call the new function with the same argvs but I'm getting a TypeError?

image

EDIT I suspect this to be an issue with batch_selfies_to_flat_hot taking different params.

Here are the params taken by batch_selfies_to_flat_hot:

    :param selfies_batch: a list of SELFIES to be converted.
    :param vocab_stoi: a dictionary that maps SELFIES symbols (the keys)
        to a non-negative index. The indices of the dictionary
        must contiguous, starting from 0.
    :param pad_to_len: the length that each SELFIES is be padded to.
        If ``pad_to_len`` is less than or equal to the symbol
        length of the SELFIES, then no padding is added. Defaults to ``-1``.
    :return: the flattened one-hot encoded representations of the SELFIES
        from the batch. This is a 2D list of size
        ``(batch_size, N * len(vocab_stoi))``.

The params passed in right now are selfies_list, largest_selfie_len, selfies_alphabet.

Do you have any ideas for an easy fix? If I'm write the code before it needs to be basically rewritten to fit those new params?

ncfrey commented 3 years ago

It seems the order of the params switched around, so selfies_list -> selfies_batch, largest_selfie_len -> pad_to_len, and selfies_alphabet -> vocab_stoi. I think the types are the same so hopefully you can reorder the inputs and it should work without actually changing them.

alat-rights commented 3 years ago

Hi @ncfrey !

It seems like selfies_alphabet and vocab_stoi behave differently; it appears that vocab_stoi takes in single character representations of atoms while selfies_alphabet is a little more complicated (and is also a list as-is), I was wondering if you had any ideas as to how to approach?

Thanks.

alat-rights commented 3 years ago

image

alat-rights commented 3 years ago

Hi! Sorry for not clarifying the above image.

That was the error I receieved after passing dict(zip(selfies_alphabet, range(len(selfies_alphabet)))) as the vocab_stoi parameter to fix the type mismatch issue.

I was wondering if you anyone has insight here? @ncfrey

ncfrey commented 3 years ago

There is a new example in the SELFIES repo here: https://github.com/aspuru-guzik-group/selfies#label-integer-encoding-selfies that hopefully will be of some help.

It seems that there's an extra step now, after constructing the alphabet you have to do vocab_stoi = {s: i for i, s in enumerate(alphabet)} to get vocab_stoi. Then I think you'll have to specify enc_type='one_hot' to get one hot encodings used in the rest of the tutorial, rather than integer labels.

My apologies I haven't had a chance to test this out myself, but thanks for taking this on! If you can implement a fix and submit a PR, that would be great. It's probably a good idea to pin the selfies version with pip install selfies==1.0.2. I think that the version of selfies I was using at the time wasn't a major release, so that's what's causing the dependency issues.

alat-rights commented 3 years ago

Ok! Should the final call look something like onehots = sf.batch_selfies_to_flat_hot(selfies_list, symbol_to_idx, largest_selfie_len, enc_type='one_hot') or sholud it actually be a call to selfies_to_encoding?

enc_type='one_hot' seems superfluous in a call to batch_selfies_to_flat_hot, right?

alat-rights commented 3 years ago

It looks like calling selfies_to_encoding was the fix! Will update once I get this running and hopefully put it into a PR soon.

alat-rights commented 3 years ago

I don't feel the need to make a new issue for this, but I've ran into another bug down the line on the tutorial:

There was a list index out of range error here:

image

I'm having a bit of trouble reading the code. Do you have any idea as to what's happening?

@ncfrey

alat-rights commented 3 years ago

Hm, fixed that. WIll be in the PR :))

alat-rights commented 3 years ago

Hey @ncfrey I've been able to resolve all of the errors, but there are still some sticking points and weirder problems that I'm trying to iron out.

Would it be possible for us to meet online sometime soon? My availability looks quite similar, and afternoons/evenings eastern time generally work great for me.

ncfrey commented 3 years ago

Yeah for sure! I'll send a calendar invite and feel free to adjust the time if needed.

alat-rights commented 3 years ago

That PR failed horribly, accidentally checked in a Conda installer, but I WILL MAKE ANOTHER ONE, I already have all the necessary changes made on a Colab doc, this issue will be closed soon y'all!