NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

OpenRefine adding "\r" to cataloger last name upon export #519

Open RebekkaML opened 2 months ago

RebekkaML commented 2 months ago

During the process of uploading a file created by the DigiApp into Specify, this file is first checked for errors in LibreOffice (Excel would change some of the formatting and create problems) and then processed in OpenRefine.

When exporting this processed file from OpenRefine again, there is sometimes a "\r" behind the cataloger last name. This does not appear in OpenRefine itself, only in the exported .tsv file. It is not a huge issue since we can just remove it in the Specify workbench when we want to import it, but it causes duplicate agents to be create if we overlook it.

It seems to only happen in files that we edited and saved in LibreOffice (But not every single one of these files). If there is nothing wrong with the file and we close it again without saving, it doesn't happen. We usually export them with windows-1252 encoding, but we also tried UTF-8 and it produced the same error.

An example OpenRefine Export file can be found here: "N:\SCI-SNM-DigitalCollections\DaSSCo\DigiApp\Data\6.Archive\NHMD_Herbarium\v1.2.0_20240227\NHMD_Herba_20240517_16_28_SS_processed.tsv"

The file that was edited in LibreOffice and then imported into OpenRefine is here:"N:\SCI-SNM-DigitalCollections\DaSSCo\DigiApp\Data\6.Archive\NHMD_Herbarium\v1.2.0_20240227\NHMD_Herba_20240517_16_28_SS_checked_corrected.csv"

FedorSteeman commented 2 months ago

@RebekkaML LibreOffice on MacOS or Windows?

I can't replicate on Windows, but in LibreOffice, you could try to ensure how files are saved:

Image

RebekkaML commented 2 months ago

@FedorSteeman I'm using Windows 10.

I saved the file the way you suggested. Most settings were the same, only the default character set was windows 1252, I changed it to UTF-8 now.

But this didn't solve the problem.

FedorSteeman commented 2 months ago

I can't replicate the issue, so you could try on another computer and maybe also update your LibreOffice.