adiwg / mdTranslator

Metadata translation tool built using Ruby
https://www.adiwg.org/mdTranslator/
The Unlicense
14 stars 12 forks source link

New simplified HTML writer for cleaner import into MS Word #246

Closed dwalt closed 11 months ago

dwalt commented 1 year ago

Create a new writer to output metadata in a format importable into Word. Commonly used in USGS to create a metadata review document with commentary added.

hmaier-fws commented 1 year ago

@dwalt I'm wondering if we just need a simplified version of HTML?

It seems that the problem with the current html export is that word doesn't like all of the nested expandable sections. Word seems to do fine with basic html (try downloading the https://json-schema.org/ or the https://git-scm.com/docs/git-branch pages and importing them into word).

This would be a simpler solution since most of the work has already been done. It should just require modifying the html tags used for output instead of implementing a completely new RTF standard.

dwalt commented 1 year ago

@hmaier-fws Agreed, a different stylesheet might be all we need. Would be a good place to start, using the friendly labels that the current HTML writer uses in expanded format to see if this works for users.

jwaspin commented 1 year ago

I was able to import the html version into Word. Is the issue just with the formatting?

hmaier-fws commented 1 year ago

@jwaspin Yes. The HTML does import into word, but it is virtually unusable because of the formatting. If we can export to a simplified HTML document that should allow most of the basic formatting to be rendered by word. The problem is all of the collapsible sections, right navigation menu, and associated javascript enhancements such as the geographic extent, which is embedded as a map object.

I'm not sure how word imports html. Maybe it's just enough to edit the CSS to hide some of the problematic sections?

We do not want to alter the existing html export, we want to add a new "simple html" export.

dwalt commented 1 year ago

@jwaspin I think we can start with a copy of the CSS and hack it to improve formatting and see where that gets us, such as fonts, line spacing, indentation and map graphic rendering.

I have enclosed an HTML writer import to word, a CSDGM import to Word and a CSDGM text import to Word. The later two were produced using the Metadata Parser, a CSDGM validation tool that also formats XML to HTML and TXT: https://www1.usgs.gov/mp/. The CSDGM HTML from MP has been used extensively in USGS to create review docs and as a presentation format for data releases. I think it would be a good model to work from regarding format. However, indentation is lost in the import to Word. Run MP, loading a CSDGM. and review the Outline format to see what I mean. The txt format retains indentation but obviously is plain vanilla styling. I would point out the section links at the top of the CSDGM Word doc which are very handy for navigation.

I downloaded Metadata Parser but did not come up with a CSS. htmlCSDGMWordImport.docx htmlWriterWordImport.docx

jwaspin commented 1 year ago

@dwalt I looked into this and the CSS does not appear to be the issue, there were some tags that were not standard html tags that word just doesn't know what to do with. So I changed them all to div tags and that seems to resolve the issue. I think there's probably some cleanup that could happen to make it look a little better in certain places, but I wanted to let you take a look before I spent more time tinkering with this one.

hmaier-fws commented 1 year ago

@jwaspin do you have an example of the updated html that we could try importing?

jwaspin commented 1 year ago

@hmaier-fws Yes, I just pushed the output file I produced here:

https://github.com/adiwg/mdTranslator/blob/feature/simple-html-for-word/simple.html

dwalt commented 1 year ago

@jwaspin This is definitely an improvement. We lost the section links and the map graphic. but this is a start.

dwalt commented 11 months ago

I'm not sure why this was closed. It is not available anywhere for anyone to test it. It is in the drop-down list here: http://34.201.136.147:3002/ but it doesn't work: "Writer cannot be found".

dwalt commented 11 months ago

The writer works now on 3002. It will fail however if there are mdJSON errors. Seems an HTML writer should not care whether or not a record is valid, as the output can be useful to trace content errors. I turned off validation and force valid output settings, but still it failed. I am closing this issue since this was intended as a prototype to spur refined requirements from users.