colincsl / LCTM

35 stars 12 forks source link

Use for text sequence labeling #1

Open bratao opened 8 years ago

bratao commented 8 years ago

@colincsl , This is an awesome project !! Very state-of-art research, congratulations!

I'm wondering if this project could be used for text sequence labeling. I'm very interested in testing your SC-CRF for texts. Do you think that this is feasible ?

colincsl commented 8 years ago

Glad you're interested! Yes, it should be usable for text. The only requirement is that the input is a matrix of floating point numbers.

How are you planning to quantize the text? Using word2vec as input could work well.

bratao commented 8 years ago

Great, I will try over the weekend and will report the results.

What the range of the input ? Seems like is from -7.0 to 7.0 in the Suture example I was planing to give a unique int id for each token. My task is to do Entity recognition. In my case the word2vec embedding are useless because other features around such as punctuation and the font are what discriminate those entities.

Do you have an idea how should I transform my dataset to use with LCTM ? Where each token have a unique id. What about those handcrafted features, such as ends_with_comma=true. Do you suggest a way to represent it ?

Thank you again ! I´m using for a master project in my university. How would you like to cite you ?

colincsl commented 8 years ago

Yeah, I normalize my input using mean/standard deviation, hence the -7 to 7 range.

To use this you would need to create a large input vector the size of your vocabulary and use a one-hot encoding. I'm not really sure how people in NLP deal with punctuation.