CSV ignore lines - Githubissues

peterdesmet commented 3 years ago

Here is how different tools can handle comment vs frontmatter lines

comment lines

# id: behel_vp_20191015_1415_data.csv
# how:
#   clutterMap: "clutter.h5"
#   dealiased: 0
# where:
#   lon: 5.176
#   lat: 52.101

tool	default	ignore solution
frictionless	ignored, but fail on second file `[file1.csv, file2.csv]`	none, likely bug
jekyll	fail	maybe possible in template?
pandas `read_csv()`	fail	`comment='#'`
Google spreadsheet	added	manual remove afterwards
Excel	added	start import at row manual
R `read.csv`	added, but data parsed badly	`comment.char = "#"`
R `read_csv`	added, but data parsed badly	`comment = "#"`

frontmatter

---
id: behel_vp_20191015_1415_data.csv
how:
  clutterMap: "clutter.h5"
  dealiased: 0
where:
  lon: 5.176
  lat: 52.101
---

tool	default	ignore solution
frictionless	ignored, fail on second file `[file1.csv, file2.csv]`	none, likely bug
jekyll	fail	maybe possible in template?
pandas `read_csv()`	fail	`skiprows=9` manual setting of rows
Google spreadsheet	added	manual remove afterwards
Excel	added	start import at row manual
R `read.csv`	added, but data parsed badly	`skip = 9` manual setting of rows
R `read_csv`	added, but data parsed badly	`skip = 9` manual setting of rows

peterdesmet commented 3 years ago

@adokter it looks like the # comments approach is the best one, as it does not rely on knowing the number of non-data lines. Most csv readers have an option to ignore those.

peterdesmet commented 3 years ago

@adokter and here is R code to both read the data and metadata from such a file:

# Load libraries
library(tidyverse)
library(yaml)

# Path of data
data_file <- "examples/vol2bird/behel_vpts_20191015.csv"

# Read data (default csv parser)
data <- read.csv(data_file, comment = "#")
head(data, 10)

## Read metadata (yaml in comments)
comments <- grep("^\\s*#", readLines(data_file), value = TRUE)
yaml_start_end <- which(comments %in% "# ---")
yaml_start <- yaml_start_end[1] + 1
yaml_end <- yaml_start_end[2] - 1

raw_yaml <-
  comments[yaml_start:yaml_end] %>%
  str_remove("# ") %>%
  paste(collapse = "\n")
yaml.load(raw_yaml)

So that works at least.

Update: I have adapted the code so it can detect yaml content --- in a larger comment block. Feels saver.

niconoe commented 3 years ago

Would the header be always the same number of lines? I'd prefer a solution where this can vary (more flexible, and not much more complex). In other means, I'm not sure the skip = 9 (and related) suggestion in the table above is enough. (but it wouldn't be rocket science to write a small wrapper on top of the CSV parsers to actually detect the number of lines to ignore).

niconoe commented 3 years ago

Python solution (including use of iterators to deal with large files without loading everything in memory): https://stackoverflow.com/questions/14158868/python-skip-comment-lines-marked-with-in-csv-dictreader

niconoe commented 3 years ago

Shouldn't we start a document somewhere were we start documenting the whole file format (not only the header, and not only its content)?

niconoe commented 3 years ago

Another important question that will come for the file definition: do we guarantee that all commented lines contains YAML (if so, we remove the possibility to add human-readable comments later there), or not (in that case, we need a machine friendly mechanism to detect where does the YAML start and end)

peterdesmet commented 3 years ago

@niconoe

https://github.com/enram/vpts/issues/7#issuecomment-828289649: No, I don't want to rely on the header always being the same number of lines, which is why I indicated those solutions with manual. The comments approach does not have that issue, which is why I prefer it.

https://github.com/enram/vpts/issues/7#issuecomment-828293284: Yes, that will be documented in the http://github.com/adokter/vol2bird repo

https://github.com/enram/vpts/issues/7#issuecomment-828297341: No, which is why I adapted my code. The yaml block should be delimited by --- (standard yaml approach). See https://github.com/enram/vpts/blob/main/examples/vol2bird/behel_vpts_20191015.csv

peterdesmet commented 3 years ago

@adokter on @niconoe's suggestion, I have started a separate repo to write and test parsing code for the new vol2bird csv files (R and Python). That way we can immediately test that what is being spit out by vol2bird is parsable. The R code is already written (longer version of code above): https://github.com/enram/vp-parser/blob/main/R/parser.R

niconoe commented 3 years ago

Over D3/JS/CROW:

D3's read CSV feature doesn't handle comments
Just like for other languages, it can be circumvented by writing a function that skips/extract the comment, and pass the rest to the existing CSV-reading code
CROW doesn't actually use D3 to parse the CSV (just some basic JavaScript string manipulations)

niconoe commented 3 years ago

@peterdesmet : I remember this question being discussed in a slightly different context (about an "intermediate" format to use before agreeing on a proper standard).

Now that we changed ideas and want to make vpts a more official exchange format: are you still attached to the idea of embedding metadata in YAML itself embedded in CSV comments, or would you consider a more straightforward approach where CSV contains only data, and all metadata is moved to datapackage.json?

peterdesmet commented 3 years ago

The CSV should only contain the data (no comment lines or yaml).

enram / vpts-csv

CSV ignore lines #7

comment lines

frontmatter