Sicos1977 / IFilterTextReader

A reader that gets text from different file formats through the IFilter interface
Other
55 stars 38 forks source link

TextReader not recognixing line breaks in .docx File #30

Closed DHoeschele closed 6 years ago

DHoeschele commented 6 years ago

Hi, I'm not sure if this is a problem with IFilterTextReader or the Windows IFilter. I have a docx file with these lines:

FullText Search versus ElasticSearch Extracting words from MS files and PDFs Use IFilters to extract text for ElasticSearch This is the end

The docx file is attached. Test IFilter.docx

This is returned from FilterReader ReadToEnd()

"FullText" & vbLf & " Search versus ElasticSearchExtractin" & vbLf & "g words from MS files and PDFsUse IFilters to extract text for ElasticSearch This is the end" & vbLf

It seems the vblf's are in the wrong place and ElasticSearchExtracting should be broken into two words.

I'm running Windows 10 and VisualStudio 2017.

Thanks for your help Dave

Sicos1977 commented 6 years ago

iFilters are for indexing, there is no options to get the exact same text as how it is in a Word document. A Word document works with formatting like paragraphs, line breaks, etc... and an iFilter doesn't

I checked the iFilter but it is giving me the chunck exactly how you are describing them.

DHoeschele commented 6 years ago

Hi Kees, Thank you for the reply. I understand your point. I don’t really care about line breaks, just the words. But I don’t think Indexing is very useful when it returns

" Search versus ElasticSearchExtractin" & vbLf & "g words

When the input was

FullText Search versus ElasticSearch Extracting words

Thanks, Dave

From: Kees notifications@github.com Sent: Tuesday, October 2, 2018 11:13 AM To: Sicos1977/IFilterTextReader IFilterTextReader@noreply.github.com Cc: Dave Hoeschele dhoeschele@accesscorp.com; Author author@noreply.github.com Subject: Re: [Sicos1977/IFilterTextReader] TextReader not recognixing line breaks in .docx File (#30)

iFilters are for indexing, there is no options to get the exact same text as how it is in a Word document. A Word document works with formatting like paragraphs, line breaks, etc... and an iFilter doesn't

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Sicos1977_IFilterTextReader_issues_30-23issuecomment-2D426312209&d=DwMCaQ&c=ZIDVjFRhCN0DRT5UkiESs3wFvGshyeGNRFoIZxPLWOQ&r=xEbzpAlSeJDHJgw_Cmk00D9cZvNjsWlVWm-HwRmzMPM&m=Fo-WJ7aLgPcx0SW57PoDU_-Ci0QNxwkmdLDQvCQvB-g&s=yrcWG34bVsoArqwfj3D4ebofRXBdMkYRkKe0DJH46tc&e=, or mute the thread [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ApwbbRrj6Pd0BriUkGIabLsouaO2e0-5Fsks5ug4KGgaJpZM4XEIzv&d=DwMCaQ&c=ZIDVjFRhCN0DRT5UkiESs3wFvGshyeGNRFoIZxPLWOQ&r=xEbzpAlSeJDHJgw_Cmk00D9cZvNjsWlVWm-HwRmzMPM&m=Fo-WJ7aLgPcx0SW57PoDU_-Ci0QNxwkmdLDQvCQvB-g&s=8Ohd3BM_1KJAgIGEKhLiD_W7gE-Qx8PxY43u17LUjGs&e=.

Sicos1977 commented 6 years ago

It's the iFilter that is returning it that way. iFilterTextReader just returns what a Windows iFilter is returning. I do some cleanup in the code but nothing that gives you " Search versus ElasticSearchExtractin" & vbLf & "g words

DHoeschele commented 6 years ago

OK, Thanks. I’ll ask MicroSoft about it. Though I’m not expecting much of an answer from them. Thanks for your time. You have developed a nice product. Dave

From: Kees notifications@github.com Sent: Tuesday, October 2, 2018 11:41 AM To: Sicos1977/IFilterTextReader IFilterTextReader@noreply.github.com Cc: Dave Hoeschele dhoeschele@accesscorp.com; Author author@noreply.github.com Subject: Re: [Sicos1977/IFilterTextReader] TextReader not recognixing line breaks in .docx File (#30)

It's the iFilter that is returning it that way. iFilterTextReader just returns what a Windows iFilter is returning. I do some cleanup in the code but nothing that gives you " Search versus ElasticSearchExtractin" & vbLf & "g words

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Sicos1977_IFilterTextReader_issues_30-23issuecomment-2D426322810&d=DwMFaQ&c=ZIDVjFRhCN0DRT5UkiESs3wFvGshyeGNRFoIZxPLWOQ&r=xEbzpAlSeJDHJgw_Cmk00D9cZvNjsWlVWm-HwRmzMPM&m=GqNw1t_tMUgzpsS2wlSYDgk3tYeixU84yMOiq6dnXKk&s=39ppmEdvTRzCX0lVIULNbASl4-sUtyBU9lKUUNcsWgE&e=, or mute the thread [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ApwbbYH0xW37NFS6XO-5FpF1k8jFR1piXYks5ug4kKgaJpZM4XEIzv&d=DwMFaQ&c=ZIDVjFRhCN0DRT5UkiESs3wFvGshyeGNRFoIZxPLWOQ&r=xEbzpAlSeJDHJgw_Cmk00D9cZvNjsWlVWm-HwRmzMPM&m=GqNw1t_tMUgzpsS2wlSYDgk3tYeixU84yMOiq6dnXKk&s=6kRm8Jlip_tN3tijudz7FpVVWAaj8vWB0l60VmfydEg&e=.