daveneiman / fits

File Information Tool Set
http://fitstool.org
Other
3 stars 0 forks source link

Structure of Aggregated XML Output #3

Open Asbjoedt opened 4 years ago

Asbjoedt commented 4 years ago

Hello Dave

First of all, thank you so much for this quick solution for an aggregated XML output file (fitscollection.xml). That's simply amazing.

We did some tests with the 1.5.1-SNAPSHOT today on our batch of files "Excel test korpus" which we have created to analyze and convert Excel files as part of an ongoing investigation on whether to accept spreadsheets in our collections in other formats than the current standard format TIFF.

The -a function does exactly what we need it to do. We now have singular xml output to import into our analytical program (Excel) and in this file we compare the outputs of different identification, characterization and validation software including so far FIDO, JHOVE, DROID, Siegfried and FITS.

However these other programs output a singular output file (csv or xml) with each analyzed file corresponding to one row in our imported Excel. This differ from FITS which output 27 rows per analyzed file. This leads us to believe there is something in the XML schema or structure that perhaps suited well for the analysis of a single file with a single XML output but it does not suit very well for a singular XML file with multiple analyzed files in it.

What do you think of this? Is it simply us that misunderstands something?

We have attached our XML output file and our imported Excel file based on the "Excel test korpus". Be aware that GitHub does not allow the attachment of XML files so we changed extension to .txt but you can change back. I presume that's a completely harmless conversion.

Looking forward to reading your response. Regards Asbjoern

FITS_testlog.txt _Resultater.xlsx

daveneiman commented 4 years ago

Hello Asbjoern,

I am not quite sure I understand your questions. However, here is what I was able to determine. I was able to change the .txt extension to .xml without a problem to more easily examine the FITS output. The tools contained within FITS can properly identify the format of .xsl files in the <identification> section.

    <identification>
      <identity format="Microsoft Excel" mimetype="application/vnd.ms-excel" toolname="FITS" toolversion="1.5.1-SNAPSHOT">
        <tool toolname="Droid" toolversion="6.4" />
        <tool toolname="Exiftool" toolversion="11.54" />
        <tool toolname="Tika" toolversion="1.21" />
        <version toolname="Droid" toolversion="6.4" status="CONFLICT">8</version>
        <version toolname="Droid" toolversion="6.4" status="CONFLICT">8X</version>
        <externalIdentifier toolname="Droid" toolversion="6.4" type="puid">fmt/61</externalIdentifier>
        <externalIdentifier toolname="Droid" toolversion="6.4" type="puid">fmt/62</externalIdentifier>
      </identity>
    </identification>

However, there is not agreement within the section for .xslx files.

    <identification status="CONFLICT">
      <identity format="Office Open XML Document" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" toolname="FITS" toolversion="1.5.1-SNAPSHOT">
        <tool toolname="Droid" toolversion="6.4" />
        <version toolname="Droid" toolversion="6.4">2007 onwards</version>
        <externalIdentifier toolname="Droid" toolversion="6.4" type="puid">fmt/189</externalIdentifier>
      </identity>
      <identity format="XLSX" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" toolname="FITS" toolversion="1.5.1-SNAPSHOT">
        <tool toolname="Exiftool" toolversion="11.54" />
      </identity>
      <identity format="Office Open XML Workbook" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" toolname="FITS" toolversion="1.5.1-SNAPSHOT">
        <tool toolname="Tika" toolversion="1.21" />
      </identity>
    </identification>

This conflict can often be resolved by the "normalization" of tool output within the FITS code as has been done for other file formats. By doing this you would be able to see an <identification> section for .xlsx files similar to .xls files.

However, what I see as a larger problem is that there is no metadata output for any files in this corpus as seen in the empty <metadata /> element within each file section of the FITS output file. The reason for this is that FITS does not handle metadata output for spreadsheet file types (as it does for images, video, word processing documents, etc.). I'm sorry to say that there are currently no plans within our technology group here at the Harvard Library to address spreadsheet files. The only new file formats we may be adding in the coming year are disk images and CAD documents. Adding metadata output for a new file type is a significant amount of work -- much, much more than adding the -a flag. It may be necessary to stick with the other tools you are using to process spreadsheet (Excel) files. I welcome your response.

Sincerely, David

Asbjoedt commented 4 years ago

Hello David

Happy New Year!!!

Thank you for explaining the normalisation possibility. We have been wondering why the different tools provided different mimetype and PUID of the .xlsx format.

Also the metadata section would be nice to have but we cannot dedicate the resources either. We had actually not noticed yet that it was missing. :-)

I also now understand your difficulties with our previous issue. Because: We provided you with a spreadsheet of output data that did not include the new -a FITS analysis! I have embedded the updated file. It opens the FITS sheet as default.

_Resultater.xlsx

In the sheet you can see that the analysis of the first file is spread over the first 28 rows rather than just include all info from the same columns in one row. Do you see? What do you think?

Regards Asbjørn

daveneiman commented 4 years ago

Hello Asbjørn,

If I understand correctly, your concern is about why the FITS output for a single file is appearing in multiple rows as seen in the spreadsheet you supplied. If so, I’m afraid I have no explanation for this. Though I’m a software developer I have no knowledge or experience about importing XML into an Excel spreadsheet. If your concern is more about the FITS output then please explain further.

Thank you, David