IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
881 stars 493 forks source link

Redetect File Type API with ingested files #9429

Closed stevenferey closed 1 year ago

stevenferey commented 1 year ago

Issue created by the "entrepot.recherche.data.gouv.fr" team

What steps does it take to reproduce the issue?

Use API redetect File Type with an ingested file (dryRun to False or True)

Which page(s) does it occur on?

API resource

What happens?

In Dataverse, a tabular file that is ingested produces a .tab file with a mimetype=text/tab-separated-values

running the redetect File Type API on a .tab file changes its mimetype to text/tsv as it is declared in the mime.types file:

https://github.com/IQSS/dataverse/blob/1a797171cdb73741b5da4a683f38697558349b5c/src/main/java/META-INF/mime.types#L9-L10

API return: {"status":"OK","data":{"dryRun":true,"oldContentType":"text/tab-separated-values","newContentType":"text/tsv"}}

Is this the right behavior ?

To whom does it occur (all users, curators, superusers)?

all users

What did you expect to happen?

I think there can be two solutions:


First:

In the mime.types file, edit the entry

# Common statistical data formats
text/tsv tab TAB tsv TSV

by

# Common statistical data formats
text/tsv tsv TSV
text/tab-separated-values tab TAB

Second :

In the mime.types file, edit the entry

# Common statistical data formats
text/tsv tab TAB tsv TSV

by

# Common statistical data formats
text/tsv tsv TSV

And add the mimetype in the MimeTypeDetectionByFileExtension.properties file:

tab=text/tab-separated-values


If you want to keep the tab-separated-values mimetype for ingested files, it's better not to be able to change it with MimeTypeDetectionByFileExtension.properties

Which version of Dataverse are you using?

5.12.1

Any related open or closed issues to this bug report?

stevenferey commented 1 year ago

Hello,

To contribute a PR, are you interested in one of the proposals mentioned in the description please?

Thank you so much, Steven.

landreev commented 1 year ago

Yes, this definitely is a problem that needs to be fixed. (in other words, no, the current behavior is not correct).

There was some rationale behind the original decision to use 2 different mime types for ingested and uningested tab-delimited files. As a way to fix this, I would strongly prefer not to touch either of the 2 type definition files above, and instead simply make the redetect API skip ingested files. There shouldn't be any practical case where redetecting the type of an ingested file could be necessary or useful. In other words, I would fix this by simply adding a if (!dataFileIn.isTabularData()) to the redetectDatafile() method in Files.java.

Apologies for overlooking this issue when you opened it originally. Thank you for bringing this to our attention and for offering to make a PR.