bedapudi6788 / deepcorrect

Text and Punctuation correction with Deep Learning
GNU General Public License v3.0
129 stars 33 forks source link

Input and Output Length Doesn't Remain Same #11

Open nareshmungpara opened 4 years ago

nareshmungpara commented 4 years ago

image

bedapudi6788 commented 4 years ago

@nareshmungpara what do you mean?

nareshmungpara commented 4 years ago

When we do segmentation and then we pass that segmented sentences to auto punctuation in that the input text length to segmentation is not same as the combined output length of autopunct output.

As shown in the above screenshot that the last 2 lines are the length of characters before and after.

bedapudi6788 commented 4 years ago

I still don't understand the issue. Can you give a simpler example or try to explain it in detail?

nareshmungpara commented 4 years ago
def correct_sentence__test():

    segmenter = DeepSegment('en')
    chunk_size=200 # read 600 character at a time
    for txt_file in os.listdir('input'):
        output_data = ''
        f = open('input/{}'.format(txt_file),'r')

        data = f.read()
        seg_data_arr = segmenter.segment_long(data)
        for seg_data in seg_data_arr:
            if len(seg_data) > chunk_size:
                for chunk_str in chunkstring(seg_data, chunk_size):
                    punt_data = corrector.correct(chunk_str)[0]['sequence'] # get corrected output
                    output_data += ''.join(punt_data)
            else:
                punt_data = corrector.correct(seg_data)[0]['sequence'] # get corrected output
                output_data += ''.join(punt_data)
        print("Input Data:- ", data)
        print("Output Data:- ", output_data)
        print("Length of Input Data:- ", len(data))
        print("Lenght of Output Data:- ", len(output_data))

Please have a look at above code. Following is the output which I am getting.

Input Data:-
hello everyone and welcome this is the management team review of the post cap editor we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again Dorsey another demonstration so let's begin we'll start with the overview first we're going to review the project and phase one objectives then we'll review the four major elements of phase one milestone one that's what we've been working on the past few weeks

Output Data:-
Hello, everyone and welcome.This is the management team review of the post cap editor.We decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anyted to hear it explained agai.N.Dorsey another demonstration.So let's begin.We'll start with the overview first.We're going to review the project and phase one objectives.Then we'll review the four major elements of phase one milestone one.That's what we've been working on the past few weeks.

Length of Input Data:- 547 Lenght of Output Data:- 520

nareshmungpara commented 4 years ago

Debugging the code I found today is segmentation is working fine but still it is not working the same as it is shown in the demo.

Input Text:-

hello everyone and welcome this is the management team review of the post cap editor we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again Dorsey another demonstration so let's begin we'll start with the overview first we're going to review the project and phase one objectives then we'll review the four major elements of phase one milestone one that's what we've been working on the past few weeks

Output Text Of Demo:-

Hello, everyone and welcome. This is the management team review of the post cap editor. We decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth. If you wanted to hear, it explained again. Dorsey, another demonstration, so let's begin. We'll start with the overview first. We're going to review the project and phase one objectives, then we'll review the four major elements of phase one milestone one. That's what we've been working on the past few weeks.

Output Segmentation In My code:-

hello everyone and welcome.this is the management team review of the post cap editor.we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again.Dorsey another demonstration.so let's begin.we'll start with the overview first.we're going to review the project and phase one objectives.then we'll review the four major elements of phase one milestone one.that's what we've been working on the past few weeks

Output Segmentation and Punctuation In My code:-

Code Which I am using:-

def correct_sentence__test__():

#     segmenter = DeepSegment('en', tf_serving=True)
    segmenter = DeepSegment('en')

    chunk_size=600 # read 600 character at a time
    for txt_file in os.listdir('input'):
        output_data = ''
        f = open('input/{}'.format(txt_file),'r')

        data = f.read()
        seg_data_arr = segmenter.segment_long(data, 10)
#         seg_data_arr = [segmenter.segment_long(ele, 1) for ele in seg_data_arr]
        for ele in seg_data_arr:
            print([ele])
        print("Input Data:-\n", [data])
        print("\n")
        print("Length of Input Data:- ", len(data))
        print("\n")
        print("Seg Data OP:-\n",['.'.join(seg_data_arr)])
        print("\n")
        print("Seg Data OP Len:-", len('.'.join(seg_data_arr)))
        print("\n")
        len_count = 0
        for seg_data in seg_data_arr:
            if len(seg_data) > chunk_size:
                for chunk_str in chunkstring(seg_data, chunk_size):
                    punt_data = corrector.correct(chunk_str, 98)[0]['sequence'] # get corrected output
                    print("Seg Data:-", [chunk_str], len(chunk_str))
                    print("Punct Data:-", [punt_data], len(punt_data))
                    len_count += len(punt_data)
                    output_data += ''.join(punt_data)
            else:
                punt_data = corrector.correct(seg_data, 98)[0]['sequence'] # get corrected output
                print("Seg Data:-", [seg_data], len(seg_data))
                print("Punct Data:-", [punt_data], len(punt_data))
                len_count += len(punt_data)
                output_data += '.'.join(punt_data)
        print("Len Punt Count:-\n", len_count)
        print("Input Data:-\n", data)
        print("Output Data:-\n", output_data)
        print("Length of Input Data:-\n", len(data))
        print("Lenght of Output Data:-\n", len(output_data))

Here is the Entire Op that I am getting:-

WARNING:root:Consider using segment_long for longer sentences. WARNING:root:Consider using segment_long for longer sentences. ['hello everyone and welcome'] ['this is the management team review of the post cap editor'] ['we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again'] ['Dorsey another demonstration'] ["so let's begin"] ["we'll start with the overview first"] ["we're going to review the project and phase one objectives"] ["then we'll review the four major elements of phase one milestone one"] ["that's what we've been working on the past few weeks"] Input Data:- ["hello everyone and welcome this is the management team review of the post cap editor we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again Dorsey another demonstration so let's begin we'll start with the overview first we're going to review the project and phase one objectives then we'll review the four major elements of phase one milestone one that's what we've been working on the past few weeks"] Length of Input Data:- 547 Seg Data OP:- ["hello everyone and welcome.this is the management team review of the post cap editor.we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again.Dorsey another demonstration.so let's begin.we'll start with the overview first.we're going to review the project and phase one objectives.then we'll review the four major elements of phase one milestone one.that's what we've been working on the past few weeks"] Seg Data OP Len:- 547 Seg Data:- ['hello everyone and welcome'] 26 Punct Data:- ['Hello, everyone and welcome.'] 28 Seg Data:- ['this is the management team review of the post cap editor'] 57 Punct Data:- ['This is the management team review of the post cap editor.'] 58 Seg Data:- ['we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again'] 201 Punct Data:- ['We decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anyted to hear it explained again.'] 172 Seg Data:- ['Dorsey another demonstration'] 28 Punct Data:- ['Dorsey another demonstration.'] 29 Seg Data:- ["so let's begin"] 14 Punct Data:- ["So let's begin."] 15 Seg Data:- ["we'll start with the overview first"] 35 Punct Data:- ["We'll start with the overview first."] 36 Seg Data:- ["we're going to review the project and phase one objectives"] 58 Punct Data:- ["We're going to review the project and phase one objectives."] 59 Seg Data:- ["then we'll review the four major elements of phase one milestone one"] 68 Punct Data:- ["Then we'll review the four major elements of phase one milestone one."] 69 Seg Data:- ["that's what we've been working on the past few weeks"] 52 Punct Data:- ["That's what we've been working on the past few weeks."] 53 Len Punt Count:- 519 Input Data:- hello everyone and welcome this is the management team review of the post cap editor we decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anything more in depth if you wanted to hear it explained again Dorsey another demonstration so let's begin we'll start with the overview first we're going to review the project and phase one objectives then we'll review the four major elements of phase one milestone one that's what we've been working on the past few weeks Output Data:- Hello, everyone and welcome.This is the management team review of the post cap editor.We decided that it would be best to make a video of this so that you could review at your leisure and also be able to go back and look at anyted to hear it explained again.Dorsey another demonstration.So let's begin.We'll start with the overview first.We're going to review the project and phase one objectives.Then we'll review the four major elements of phase one milestone one.That's what we've been working on the past few weeks. Length of Input Data:- 547 Lenght of Output Data:- 519