Closed peterdesmet closed 3 years ago
@adokter it looks like the # comments
approach is the best one, as it does not rely on knowing the number of non-data lines. Most csv readers have an option to ignore those.
@adokter and here is R code to both read the data and metadata from such a file:
# Load libraries
library(tidyverse)
library(yaml)
# Path of data
data_file <- "examples/vol2bird/behel_vpts_20191015.csv"
# Read data (default csv parser)
data <- read.csv(data_file, comment = "#")
head(data, 10)
## Read metadata (yaml in comments)
comments <- grep("^\\s*#", readLines(data_file), value = TRUE)
yaml_start_end <- which(comments %in% "# ---")
yaml_start <- yaml_start_end[1] + 1
yaml_end <- yaml_start_end[2] - 1
raw_yaml <-
comments[yaml_start:yaml_end] %>%
str_remove("# ") %>%
paste(collapse = "\n")
yaml.load(raw_yaml)
So that works at least.
Update: I have adapted the code so it can detect yaml content ---
in a larger comment block. Feels saver.
Would the header be always the same number of lines? I'd prefer a solution where this can vary (more flexible, and not much more complex). In other means, I'm not sure the skip = 9
(and related) suggestion in the table above is enough. (but it wouldn't be rocket science to write a small wrapper on top of the CSV parsers to actually detect the number of lines to ignore).
Python solution (including use of iterators to deal with large files without loading everything in memory): https://stackoverflow.com/questions/14158868/python-skip-comment-lines-marked-with-in-csv-dictreader
Shouldn't we start a document somewhere were we start documenting the whole file format (not only the header, and not only its content)?
Another important question that will come for the file definition: do we guarantee that all commented lines contains YAML (if so, we remove the possibility to add human-readable comments later there), or not (in that case, we need a machine friendly mechanism to detect where does the YAML start and end)
@niconoe
https://github.com/enram/vpts/issues/7#issuecomment-828289649: No, I don't want to rely on the header always being the same number of lines, which is why I indicated those solutions with manual. The comments approach does not have that issue, which is why I prefer it.
https://github.com/enram/vpts/issues/7#issuecomment-828293284: Yes, that will be documented in the http://github.com/adokter/vol2bird repo
https://github.com/enram/vpts/issues/7#issuecomment-828297341: No, which is why I adapted my code. The yaml block should be delimited by ---
(standard yaml approach). See https://github.com/enram/vpts/blob/main/examples/vol2bird/behel_vpts_20191015.csv
@adokter on @niconoe's suggestion, I have started a separate repo to write and test parsing code for the new vol2bird csv files (R and Python). That way we can immediately test that what is being spit out by vol2bird is parsable. The R code is already written (longer version of code above): https://github.com/enram/vp-parser/blob/main/R/parser.R
Over D3/JS/CROW:
@peterdesmet : I remember this question being discussed in a slightly different context (about an "intermediate" format to use before agreeing on a proper standard).
Now that we changed ideas and want to make vpts
a more official exchange format: are you still attached to the idea of embedding metadata in YAML itself embedded in CSV comments, or would you consider a more straightforward approach where CSV contains only data, and all metadata is moved to datapackage.json?
The CSV should only contain the data (no comment lines or yaml).
Here is how different tools can handle comment vs frontmatter lines
comment lines
[file1.csv, file2.csv]
read_csv()
comment='#'
read.csv
comment.char = "#"
read_csv
comment = "#"
frontmatter
[file1.csv, file2.csv]
read_csv()
skiprows=9
manual setting of rowsread.csv
skip = 9
manual setting of rowsread_csv
skip = 9
manual setting of rows