AquaAuma / FishGlob_data

Database and methods related to the manuscript "An integrated database of fish biodiversity sampled with scientific bottom trawl surveys"
Creative Commons Attribution 4.0 International
21 stars 7 forks source link

FYI: some haul_id values exceed integer precision of read/write csv in R #49

Open afredston opened 2 months ago

afredston commented 2 months ago

Spent a day down this rabbit hole and wanted to share:

In the FISHGLOB dataset, (at least some) values in the haul_id column are very long strings of numerics. These exceed the integer precision of R functions to read and write CSVs, I think both in base R (read.csv/write.csv) and readr (read_csv/write_csv). This means that _if you write out and then read in a CSV with a haul_id column, the values will be wrong when you read it in, even if the column class was "character" when you wrote it out._ (You can see my panicked SO question when I figured this out for a reprex.)

There is a simple solution to this which is to specify that the column should be treated as a character and not a numeric when the CSV is read, like so:

hauldat <- read_csv(here("data","haul_data.csv"), col_types = cols(haul_id = col_character())) 

And this problem does not occur if other data files (e.g., Rdata) are used. So unless we change the formatting of the haul IDs, which causes other issues, we should encourage FISHGLOB users who code in R to save data in other formats.