Sicos1977 / IFilterTextReader

A reader that gets text from different file formats through the IFilter interface
Other
55 stars 38 forks source link

Document metadata properties #33

Closed mguinness closed 5 years ago

mguinness commented 5 years ago

When the includeProperties option is set to true the metadata is included in the output. Would it be possible to expose a new property on the FilterReader class as a dictionary? I can put a PR together if you have no objections.

Sicos1977 commented 5 years ago

Sure ... if you make it optional

Sicos1977 commented 5 years ago

I just released a new package https://www.nuget.org/packages/IFilterTextReader/1.6.1

mguinness commented 5 years ago

New package works great, thanks!

mantis commented 5 years ago

@Sicos1977 and @mguinness the only problem with this is that it's possible to have meta data properties are duplicated e.g.

Names: foo Names: bar

In this scenario, the dictionary generates a key already exists exception. I'll log a separate issue for this also

mguinness commented 5 years ago

Out of interest what is the output of filtdump of an example file? I imagine the tags are coming from different sections in the file. Changing the field type to List<KeyValuePair<string, object>> would work.

mantis commented 5 years ago

@mguinness - sorry, I didn't rush back to this - in this case it's the same section, but the 'different sections' is also a problem

CHUNK: --------------------------------------------------------------- Attribute = {2C443B1E-F1E2-404F-974D-E21FEF8E70AA}\Names idChunk = 13 BreakType = 2 (Sentence) Flags (chunkstate) = (Value) Locale = 2057 (0x809) IdChunkSource = 13 cwcStartSource = 0 cwcLenSource = 0

VALUE: --------------------------------------------------------------- Type = 31 (0x1f), VT_LPWSTR Value = "Test A"

CHUNK: --------------------------------------------------------------- Attribute = {2C443B1E-F1E2-404F-974D-E21FEF8E70AA}\Names idChunk = 14 BreakType = 2 (Sentence) Flags (chunkstate) = (Value) Locale = 2057 (0x809) IdChunkSource = 14 cwcStartSource = 0 cwcLenSource = 0

VALUE: --------------------------------------------------------------- Type = 31 (0x1f), VT_LPWSTR Value = "Test B"


  <rdf:Description rdf:about=""
        xmlns:TestSchema="http://test">
     <TestSchema:Names>
        <rdf:Bag>
           <rdf:li>Test A</rdf:li>
           <rdf:li>Test B</rdf:li>
        </rdf:Bag>
     </TestSchema:Names>
  </rdf:Description>

Now, whilst we changed to <string, object> - and i'm going to look at this again soon - for some reason, I seem to recall thinking that including the schema into the output would be useful: Pretty sure I found that <string becomes 'Names' - so if a purpose is to allow an application to filter on a specific filter lets say the meta data property output doesn't let you identify the same name from different paths if there is a conflict. So for example, I have

image

Where we have System.Title, title and Title.

One of them is dc:tittle - the other is TestSchema:Title - and presumably the System.Title is the default document title outside the metadata. This I think is the issue that you were hitting on?

mguinness commented 5 years ago

Thanks for the reply. The example you cited seems more like an array of names. Can you upload a small example document?

mantis commented 5 years ago

@mguinness - it was indeed an array of names - sample image uploaded below: (hopefully github doesn't modify it)

pixel