Sicos1977 / IFilterTextReader

A reader that gets text from different file formats through the IFilter interface
Other
55 stars 38 forks source link

Cannot read text from .xls file #19

Closed andreas-eriksson closed 7 years ago

andreas-eriksson commented 7 years ago

Hi, I get the following error when I try to read text from an old excel file (.xls).

at IFilterTextReader.NativeMethods.IPersistStream.Load(IStream pStm) at IFilterTextReader.FilterLoader.LoadAndInitIFilter(Stream stream, String extension, Boolean disableEmbeddedContent, String fileName, Boolean readIntoMemory) in C:\Git\IFilterTextReader\IFilterTextReader\FilterLoader.cs:line 160 at IFilterTextReader.FilterReader..ctor(String fileName, String extension, Boolean disableEmbeddedContent, Boolean includeProperties, Boolean readIntoMemory, FilterReaderTimeout filterReaderTimeout, Int32 timeout) in C:\Git\IFilterTextReader\IFilterTextReader\FilterReader.cs:line 201 at IFilterTextViewer.MainForm.SelectButton_Click(Object sender, EventArgs e) in C:\Git\IFilterTextReader\IFilterTextViewer\MainForm.cs:line 139 Exception from HRESULT: 0x8004170C

Is there anything I can do to make it work?

Sicos1977 commented 7 years ago

I saw that you fixed it yourself 👍 ... can you tell me what the problem was?

Greetings, Kees van Spelde

Sicos1977 commented 7 years ago

I published a new version (1.5.3) to nuget.

andreas-eriksson commented 7 years ago

I'm not sure why the error occurs. Some suggestions seem to indicate that installed filters could be corrupt but it happens on my test machine as well.

I am hoping that the fix will make the code work for a few more legacy formats.

Thanks :)

Sicos1977 commented 7 years ago

Is it possible to send me the old xls file so that I can investigate it some more? If so then send it to sicos2002@hotmail.com

Also if you want to do really advanced things with extracting data from files then have a look at Tika (https://tika.apache.org/). There is also a .NET port that is generated with IKVM (https://github.com/KevM/tikaondotnet).... it's not that iFilters aren't any good but there is a wider support for files in Tika. I have to do everything myself for the iFilters and there is an Apache team behind Tika with more developers. It's just a time management problem :-)

andreas-eriksson commented 7 years ago

Mail sent.

Thanks for the info, I will definitely investigate Tika.

Sicos1977 commented 7 years ago

Also just to to satisfy my own curiosity... for what are you using my library?

andreas-eriksson commented 7 years ago

It's used to extract text from documents and then making them searchable with Lucene.

Sicos1977 commented 7 years ago

Also another thing, you also can use the Java Tika version. It has a web interface that can be called from .NET. It's just what you prefer. I myself prefer .NET above Java.

andreas-eriksson commented 7 years ago

Me too.

Tika sure looks interesting, especially since it doesn't seem to have any other dependencies. Would be nice if users didn't have to install Office.

Sicos1977 commented 7 years ago

You also don't have to install office for my iFilter library. There is a iFilter package for it. You can find it overhere --> https://www.microsoft.com/en-us/download/details.aspx?id=17062

Sicos1977 commented 7 years ago

I also made an MSGReader library to extract information from MSG files. It has no Ifilter support since that is kind of difficult to make in .NET. But with some coding you probably can make it work. You can find it overhere --> https://github.com/Sicos1977/MSGReader. Other "extracting" libraries can be found overhere --> https://github.com/Sicos1977/OfficeExtractor and https://github.com/Sicos1977/VCardReader.

Office extractor extract embedded OLE objects from office files... like an Excel attachment inside a Word document.