Distinguishing between XML files and HTML files

jggautier commented 4 years ago

Dataverse categorizes some uploaded XML files as HTML, such as the two XML files in this dataset: https://doi.org/10.7910/DVN/ERQWPH. And in these cases, a poor preview of the XML file is shown where the tags and structure are removed.

Other times it categorizes uploaded XML files as XML, like the XML files in this dataset: https://doi.org/10.7910/DVN/BF2VNK. In these cases, there is no option to preview the file, which I would expect since it isn't listed in this repo's readme as a filetype that can be displayed.

Is it right that the dataverse-previewers determine which files to preview and how to preview them based on the file type or mimetype that Dataverse assigns the file when the file is uploaded? If so, could you imagine any reasons why one set of XML files have been categorized as XML files and one set was categorized as HTML files? I looked at the content of XML files from both datasets but nothing seemed obvious. Thanks :)

qqmyers commented 4 years ago

Yep - exactly. The previewers are actually ~generic - they'll try to do their thing on any data you give them, which may or may not be useful given the data. However, the triggering is solely based on the manifests sent to dataverse to register the previewer tool. (And since each manifest can only specify one mimetype, tools like the image or audio previewers get registered multiple times for different mimetypes.)

The challenge with mime-types in general is that the source of the determination can come from several places. The browser may send that info when uploading, Dataverse can check based on the file extension, and, for some types, it will even look inside the file to determine the type.

I've seen this happen with csv, where different browsers send either text/csv or text/comma-separated-value so the tabular previewer needs to be registered for both.

I'm not sure what makes sense when the mime-type sent is wrong - whether Dataverse should just impose it's own mimetype based on the extension (or override for the ones it 'knows') or if the UI should allow it to be changed.

A smarter HTML previewer might be able to show tags as an option (it's a security risk to just display them as is, but replacing the < and > with < and > codes would show them but not make them real html tags. Etc. That won't solve the problem of two mimetypes unless you just want to let the html previewer view xml as well (once someone has added an option like the one I mention).

jggautier commented 4 years ago

Thanks for the quick and helpful reply, as always!

GlobalDataverseCommunityConsortium / dataverse-previewers

Distinguishing between XML files and HTML files #37