IndicoDataSolutions / finetune

Scikit-learn style model finetuning for NLP
https://finetune.indico.io
Mozilla Public License 2.0
700 stars 81 forks source link

Numerical features #63

Closed mikkelam closed 5 years ago

mikkelam commented 6 years ago

Sorry if this is a stupid question..

I'm curious if it's possible to have numerical features using this model? The documentation says that X should be an array of text.

Thanks for your time and really nice project

benleetownsend commented 6 years ago

It is required for the model to take text as input as pre-trained weights are used that are highly specific to the vocabulary and tokenization methods. There is a possibility to return numerical feature output via the .featurize method to get a sequence embedding if this is what you meant?

If you'd like to outline a bit more of how you wish to use this system I'm happy to discuss whether this system would be a good fit and how you might go about achieving the goal.

mikkelam commented 6 years ago

Right, i'm not sure i understand the featurize function correctly..

I'm working with negotiaton happening over text and trying to classify that text into various classes

example:

I would like to offer $5000 for your item with the label: offer

Though it has some other features other than the text i'd like to include such as currency=USD and offer_amount=5000 (which i use named entity recognition to extract). Could also have other features such as the time the message was sent as a timestamp.

I'm currently using basic NLP techniques, to solve it but I'd like to try out state of the art ULM such as your library

benleetownsend commented 6 years ago

You have a few options for this, firstly I’d say that your best bet is to start with classifying end-to-end without the additional features, especially if these features were initially extracted from the raw text. You can look at the stanford sentiment example in finetune/datasets for an example of how to do this.

If you believe that it is absolutely necessary to include these features, you have a few options, these are potentially quite involved as it is something we do not currently support.

It would be a case of routing the raw additional features through to the classifier module. If this is something you are interested in we can discuss how you would do this.

Another option that is less involved in our codebase but requires more work in general would be as follows:

Thank you for the interest in this project.

benleetownsend commented 6 years ago

@mikkelam Wondering how you're getting on with this?

mikkelam commented 6 years ago

Hi benleetownsend,

I'm actually on vacation. But thank you for that long reply. I like your first approach and believe this is what I'll be doing. But like you said, I can start by testing without these additional features, it might not be necessary, though i suspect otherwise

madisonmay commented 6 years ago

@mikkelam did you ever get a chance to try this out?

madisonmay commented 5 years ago

Closing for now. Feel free to re-open @mikkelam if you have further questions.