bhoov / exbert

A Visual Analysis Tool to Explore Learned Representations in Transformers Models
http://exbert.net
Apache License 2.0
585 stars 51 forks source link

Compatibility with transformers trained on non-language sequential data #7

Closed violetguos closed 4 years ago

violetguos commented 4 years ago

Hi all,

I am training a transformer model to predict chemical reactions from chemical molecules in string representation.

Does your project support plug-n-chug for models not trained on languages?

If not, any pointers on how I should proceed?

bhoov commented 4 years ago

Unfortunately this code base only supports languages right now and it will certainly not be plug-n-chug for chemical reactions. This application area has been brought to our attention though and we are hoping to make the interface generalizable.

If you would like to plow ahead and implement this yourself, there are several moving parts that will need to be modified:

  1. Look at the most recent code in the hoo/transformers branch as this contains the most recent code.
  2. The aligner module will need to support the custom tokenizer for chemical reactions
  3. References to spacy, which is the library used to annotate language data (e.g., with part of speech/entity information), will need to be replaced with your own annotator for chemical strings, or you can choose to strip this feature out of the system entirely.
  4. The HDF5 data structure currently has attributes for the annotations mentioned above. This data structure will need to be modified for your use case.

These are all backend changes. You should only need to change the frontend if:

Hope this helps!

violetguos commented 4 years ago

Thank you for the quick reply! I'll fork this and see what I can do!