enram / vpts-csv

Data exchange format for biological signals detected by weather radars
https://aloftdata.eu/vpts-csv/
MIT License
3 stars 3 forks source link

CSV ignore lines #7

Closed peterdesmet closed 3 years ago

peterdesmet commented 3 years ago

Here is how different tools can handle comment vs frontmatter lines

comment lines

# id: behel_vp_20191015_1415_data.csv
# how:
#   clutterMap: "clutter.h5"
#   dealiased: 0
# where:
#   lon: 5.176
#   lat: 52.101
tool default ignore solution
frictionless ignored, but fail on second file [file1.csv, file2.csv] none, likely bug
jekyll fail maybe possible in template?
pandas read_csv() fail comment='#'
Google spreadsheet added manual remove afterwards
Excel added start import at row manual
R read.csv added, but data parsed badly comment.char = "#"
R read_csv added, but data parsed badly comment = "#"

frontmatter

---
id: behel_vp_20191015_1415_data.csv
how:
  clutterMap: "clutter.h5"
  dealiased: 0
where:
  lon: 5.176
  lat: 52.101
---
tool default ignore solution
frictionless ignored, fail on second file [file1.csv, file2.csv] none, likely bug
jekyll fail maybe possible in template?
pandas read_csv() fail skiprows=9 manual setting of rows
Google spreadsheet added manual remove afterwards
Excel added start import at row manual
R read.csv added, but data parsed badly skip = 9 manual setting of rows
R read_csv added, but data parsed badly skip = 9 manual setting of rows
peterdesmet commented 3 years ago

@adokter it looks like the # comments approach is the best one, as it does not rely on knowing the number of non-data lines. Most csv readers have an option to ignore those.

peterdesmet commented 3 years ago

@adokter and here is R code to both read the data and metadata from such a file:

# Load libraries
library(tidyverse)
library(yaml)

# Path of data
data_file <- "examples/vol2bird/behel_vpts_20191015.csv"

# Read data (default csv parser)
data <- read.csv(data_file, comment = "#")
head(data, 10)

## Read metadata (yaml in comments)
comments <- grep("^\\s*#", readLines(data_file), value = TRUE)
yaml_start_end <- which(comments %in% "# ---")
yaml_start <- yaml_start_end[1] + 1
yaml_end <- yaml_start_end[2] - 1

raw_yaml <-
  comments[yaml_start:yaml_end] %>%
  str_remove("# ") %>%
  paste(collapse = "\n")
yaml.load(raw_yaml)

So that works at least.

Update: I have adapted the code so it can detect yaml content --- in a larger comment block. Feels saver.

niconoe commented 3 years ago

Would the header be always the same number of lines? I'd prefer a solution where this can vary (more flexible, and not much more complex). In other means, I'm not sure the skip = 9 (and related) suggestion in the table above is enough. (but it wouldn't be rocket science to write a small wrapper on top of the CSV parsers to actually detect the number of lines to ignore).

niconoe commented 3 years ago

Python solution (including use of iterators to deal with large files without loading everything in memory): https://stackoverflow.com/questions/14158868/python-skip-comment-lines-marked-with-in-csv-dictreader

niconoe commented 3 years ago

Shouldn't we start a document somewhere were we start documenting the whole file format (not only the header, and not only its content)?

niconoe commented 3 years ago

Another important question that will come for the file definition: do we guarantee that all commented lines contains YAML (if so, we remove the possibility to add human-readable comments later there), or not (in that case, we need a machine friendly mechanism to detect where does the YAML start and end)

peterdesmet commented 3 years ago

@niconoe

https://github.com/enram/vpts/issues/7#issuecomment-828289649: No, I don't want to rely on the header always being the same number of lines, which is why I indicated those solutions with manual. The comments approach does not have that issue, which is why I prefer it.

https://github.com/enram/vpts/issues/7#issuecomment-828293284: Yes, that will be documented in the http://github.com/adokter/vol2bird repo

https://github.com/enram/vpts/issues/7#issuecomment-828297341: No, which is why I adapted my code. The yaml block should be delimited by --- (standard yaml approach). See https://github.com/enram/vpts/blob/main/examples/vol2bird/behel_vpts_20191015.csv

peterdesmet commented 3 years ago

@adokter on @niconoe's suggestion, I have started a separate repo to write and test parsing code for the new vol2bird csv files (R and Python). That way we can immediately test that what is being spit out by vol2bird is parsable. The R code is already written (longer version of code above): https://github.com/enram/vp-parser/blob/main/R/parser.R

niconoe commented 3 years ago

Over D3/JS/CROW:

niconoe commented 3 years ago

@peterdesmet : I remember this question being discussed in a slightly different context (about an "intermediate" format to use before agreeing on a proper standard).

Now that we changed ideas and want to make vpts a more official exchange format: are you still attached to the idea of embedding metadata in YAML itself embedded in CSV comments, or would you consider a more straightforward approach where CSV contains only data, and all metadata is moved to datapackage.json?

peterdesmet commented 3 years ago

The CSV should only contain the data (no comment lines or yaml).