EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

d_model parameter #39

Open williamstark01 opened 2 years ago

williamstark01 commented 2 years ago

I think that the d_model parameter (the embedding dimension) would take a significantly larger value than the value currently used. It is usually a multiple of num_heads, which usually take the value 8, so maybe an initial value of 32 would make sense here? Or is there a specific reasoning behind using a smaller value for it?

williamstark01 commented 2 years ago

This will probably help tackle overfitting (#34)

yangtcai commented 2 years ago

Cool, I will try it, could I run it in our cluster now? :D

williamstark01 commented 2 years ago

Yes, of course, after you update the dependencies you should be able to submit a training job. Adding some more details on Slack.

yangtcai commented 2 years ago

Hi @williamstark01, the reason why I set d_model = 6 is that we use one hot to encode a DNA sequence, so every sequence has the shape of [2000, 6]. Are there any ways to change a large value that is currently used?

williamstark01 commented 2 years ago

That's a good question. With label encoding the single value is converted to a tensor of shape (1, embed_dim) (simply a vector of length embed_dim, a better name for d_model). I haven't used one-hot encoded DNA sequences with transformers before and I'm not sure how they can be converted to embeddings. Maybe label encoding is the best option, but it's probably worth researching a bit to see whether other similar projects are using another approach with one-hot encoding.

yangtcai commented 2 years ago

Currently, I just put the sequence which has been encoded by one-hot to the transformer model. https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/9a5b7bb21555ae07cbb6e267d3f3c3ba1f6c98da/transformer.py#L64 and I found in your previous project, that token_embedding was used.

def forward(self, x):
    # generate token embeddings
    token_embeddings = self.token_embedding(x)

Should I add this token_embedding to our project?

williamstark01 commented 2 years ago

I thought some more about this, and I'm not sure there is a single correct answer to how we should process the base characters.

Using the bases as tokens and their one-hot encodings directly may work, but we lose the learnable embeddings which may map the bases to a higher dimensional space which may represent meaningful features. Then again, since we have so few different tokens, this may not be consequential.

For generating embeddings we also have an additional option of using n-grams as tokens instead of single bases directly.

Maybe we can continue using one-hot encodings without embeddings for now, but at some point it's probably worth taking a look at similar projects to get additional insights on this:

https://github.com/jerryji1993/DNABERT https://github.com/jdcla/DNA-transformer https://github.com/lucidrains/enformer-pytorch https://github.com/lucidrains/tf-bind-transformer https://github.com/Rvbens/non-coding-DNA-classifier

yangtcai commented 2 years ago

Ok, I will check it.