Text Part Extraction using Synaptic

gelinger777 commented 7 years ago

Hi There, I am very new to deep learning and AI. I want to ask following. Is that possible to train the synaptic in a way, that in the end result you will feed him a text of the document and from document text it will find and extract specific word expressions . Expressions are the law articles and the number of articles and paragraph. Can anybody help out if it's possible and in which direction to digg... ???

Jabher commented 7 years ago

Are you sure you need NNs for this? You probably may want some regex for that

gelinger777 commented 7 years ago

I am not quite sure. But the problem is that the way in documents law paragraphs are referenced are very many. And the text comes from image->ocr->text process and sometimes can be with typos ... So I thought a neural network would be a good to detect those better than with regex.. what's your opinion on it?

Jabher commented 7 years ago

First of all, are you sure you want to keep typos on output too?

If you actually want to use NNs, you should answer a question first: "what is the function Im trying to predict". And then try to simplify it - imagine you're a developer who is implementing it and you're trying to make this function as tiny as possible, as you're lazy, right? :) (actually it's more about making train as small as possible)

So, you probably want some function which accepts the string, and returns whether this string contains the required value. You split it in chunks of, IDK, 1 word (or how you split it), and for each chunk you returns whether this string contains the thing you wanted. Then you returns true or false (for NN it will be probability, from 0 to 1. running forward, you will usually need binary crossentropy for that, but RMSE is fine too).

Then you need to think about how to feed data inside it. You need to pass an array of numbers, and ascii code usage seems legit, but actually it's not. You need to make one-hot encoding, which means you need to take your alphabet used (a-z, A-Z, 0-9, space, dot, etc), and map it with {'a': [1, 0, 0...], 'b': [0, 1, 0...], ...}. I think you got the idea.

Then - as long as you have not-fixed-length input - you will probably need to either decide what is your max length (to feed padded word) OR use recurrent networks that accepts one "char" at a time and are able to swallow as much as you need.

after that you will get your probabilities for an ansers, but! there is one more thing to do - now out of Synaptic. You will need to pick the optimal threshold. Well, you can simply take 0.5, but as long as you will have non-balanced dataset (1000 of "correct" entries and 100 000 of "incorrect"), threshold may be shifted. You will need to build ROC curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and take AUC, which will show you the most optimal threshold.

It looks scary in text, but in code it actually looks quite simple.

Jabher commented 7 years ago

For the respect of other participants of this community - use english, please.

Feel free to write me on email: vsevolod.rodionov@gmail.com (russian is fine, but english is main language for this repo)

gelinger777 commented 7 years ago

Ok I have removed my comment in Russian. I will write you. Thank you very much.

cazala / synaptic

Text Part Extraction using Synaptic #212