UKPLab / framenet-tools

Annotate text with FrameNet frames and arguments.
Apache License 2.0
10 stars 2 forks source link

Add sklearn example #7

Open jcklie opened 5 years ago

jcklie commented 5 years ago

Is your feature request related to a problem? Please describe. The API of this tool should be compatible with sklearn. It would be nice to document how to use these together.

Describe the solution you'd like Add an example using e.g. cross validation, parameter grid search or pipelining.

AMarkard commented 5 years ago

Implementing the sklearn api for the neural networks turned out to be more difficult than expected. As "fit" requires X and Ys seperatly rather than the torchtext.data.Iterator that is currently in use. But skorch provides a nice solution by wrapping the Pytorch-network and also by SliceDataset which solves the Iterator issue. So I wrote a complete wrapper class which uses skorch to wrap the neural networks and adjusts them to thesklearn api as well as the project structure. But after that the next issue came up, due to the fact of averaging the sentence inside the neural network required to not pad the data, problems occur with sklearn. Also the datatypes that are used are not supported by sklearn. As a Collaborator of the skorch project states, the problem lies within our datastructure and the way sklearn handles the data. "Getting pytorch Datasets to work with GridSearchCV is not trivially possible. The problem is that eventually, the Dataset leaves the skorch domain and is handled directly by sklearn. sklearn only works with a couple of data types (ndarray, scipy sparse, pandas DataFrame), so you will encounter an error sooner or later." (https://github.com/skorch-dev/skorch/issues/212) To finally conclude, in order to use sklearn the datahandling needs to be completely restructured.

jcklie commented 5 years ago

What happens if you replace torchtext with flair for the embeddings and just using pytorch datasets?

AMarkard commented 5 years ago

Flair sadly comes with other drawbacks regarding our system. Flair seems to be very slow as well, at least for such huge data amounts. Nvidia Apex does NOT yield the improvement needed. After some investigation it seems like the bottleneck lies in the structure itself. Our system needs pairs of (embedded) sentences and frames. But Flair requires a wrapping as "Sentence"-objects. Therefore the usage needs to be as follows: sentence -> Flair's sentence object -> embedding -> taking the embeddings out of the sentence object -> dropping the object. (Compare old repo #25)