aodn / aodn-portal

AODN Open Geospatial Portal
https://portal.aodn.org.au/
GNU General Public License v3.0
21 stars 13 forks source link

CSV format needs a CommentStyle character for header #1275

Closed ggalibert closed 10 years ago

ggalibert commented 10 years ago

Now that the CSV file has a metadata header, it would be of great help to distinguish the header from the data by adding a % character for example at the beginning of every header line in order to distinguish them from the data.

What we get data_column_name,cf_standard_name,imos_vocabulary_name,unit_name,unit_short_name,unit_info CPHL,mass_concentration_of_chlorophyll_in_sea_water,Concentration of chlorophyll per unit volume of the water body,Milligrams per cubic metre,mg m-3, DEPTH,depth,Depth below surface of the water body,Metres,m, DOX1_1,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, DOX1_2,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, DOX1_3,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, DOX2,moles_of_oxygen_per_unit_mass_in_sea_water,Concentration of oxygen {O2} per unit mass of the water body,Micromoles per kilogram,umol kg-1, LATITUDE,latitude,Latitude north,Degrees,deg,Degrees North LONGITUDE,longitude,Longitude east,Degrees,deg,Degrees East PRES_REL,sea_water_pressure_due_to_seawater,Pressure (measured variable) in the water body exerted by overlying sea water only,Decibars,dBar, PSAL,sea_water_salinity,Practical salinity of the water body,Practical Salinity Unit,PSU, TEMP,sea_water_temperature,Temperature of the water body,Degrees Celsius,degC, TURB,,Turbidity of the water body,Nephelometric Turbidity Units,NTU,

qc_scheme_short_name,flag_value,flag_meaning,flag_description IMOS IODE,0,No QC performed,The level at which all data enter the working archive. They have not yet been quality controlled IMOS IODE,1,Good data,Top quality data in which no malfunctions have been identified and all real features have been verified during the quality control process IMOS IODE,2,Probably good data,Good data in which some features (probably real) are present but these are unconfirmed. Code 2 data are also data in which minor malfunctions may be present but these errors are small and/or can be successfully corrected without seriously affecting the overall quality of the data. IMOS IODE,3,Bad data that are potentially correctable,Suspect data in which unusual,, and probably erroneous features are observed IMOS IODE,4,Bad data,Obviously erroneous values are observed IMOS IODE,5,Value changed,Altered by a QC Centre,, with original values (before the change) preserved in the history record of the profile. eMII discourage the use of this flag. Where data values must be changed (e.g. smoothing of data sets) we strongly prefer that the original data be retained and an additional variable be added to accommodate the interpolated/corrected data values. IMOS IODE,6,Not used,Flag 6 is reserved for future use IMOS IODE,7,Not used,Flag 7 is reserved for future use IMOS IODE,8,Interpolated value,Indicates that data values are interpolated IMOS IODE,9,Missing value,Indicates that the element is missing

FID,timeseries_id,index,site_code,platform_code,deployment_code,instrument_nominal_depth,TIME,LATITUDE,LONGITUDE,DEPTH,DEPTH_burst_sd,DEPTH_burst_min,DEPTH_burst_max,DEPTH_num_obs,TEMP,TEMP_burst_sd,TEMP_burst_min,TEMP_burst_max,TEMP_num_obs,PRES_REL,PRES_REL_burst_sd,PRES_REL_burst_min,PRES_REL_burst_max,PRES_REL_num_obs,PSAL,PSAL_burst_sd,PSAL_burst_min,PSAL_burst_max,PSAL_num_obs,DOX1_1,DOX1_1_burst_sd,DOX1_1_burst_min,DOX1_1_burst_max,DOX1_1_num_obs,DOX1_2,DOX1_2_burst_sd,DOX1_2_burst_min,DOX1_2_burst_max,DOX1_2_num_obs,DOX1_3,DOX1_3_burst_sd,DOX1_3_burst_min,DOX1_3_burst_max,DOX1_3_num_obs,DOX2,DOX2_burst_sd,DOX2_burst_min,DOX2_burst_max,DOX2_num_obs,TURB,TURB_burst_sd,TURB_burst_min,TURB_burst_max,TURB_num_obs,CPHL,CPHL_burst_sd,CPHL_burst_min,CPHL_burst_max,CPHL_num_obs,CHLU,CHLU_burst_sd,CHLU_burst_min,CHLU_burst_max,CHLU_num_obs,CHLF,CHLF_burst_sd,CHLF_burst_min,CHLF_burst_max,CHLF_num_obs,geom,depth_b,sea_water_temperature_b,sea_water_pressure_due_to_sea_water_b,sea_water_salinity_b,mole_concentration_of_dissolved_molecular_oxygen_in_sea_water_b,moles_of_oxygen_per_unit_mass_in_sea_water_b,turbidity_b,mass_concentration_of_chlorophyll_in_sea_water_b anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46cb,1426,0,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:14:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46ca,1426,1,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:29:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c9,1426,2,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:44:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c8,1426,3,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:59:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c7,1426,4,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:14:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c6,1426,5,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:29:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c5,1426,6,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:44:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c4,1426,7,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:59:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true

What we would like % data_column_name,cf_standard_name,imos_vocabulary_name,unit_name,unit_short_name,unit_info % CPHL,mass_concentration_of_chlorophyll_in_sea_water,Concentration of chlorophyll per unit volume of the water body,Milligrams per cubic metre,mg m-3, % DEPTH,depth,Depth below surface of the water body,Metres,m, % DOX1_1,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, % DOX1_2,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, % DOX1_3,concentration_of_oxygen_in_sea_water,Concentration of oxygen {O2} per unit volume of the water body,Micromoles per litre,umol l-1, % DOX2,moles_of_oxygen_per_unit_mass_in_sea_water,Concentration of oxygen {O2} per unit mass of the water body,Micromoles per kilogram,umol kg-1, % LATITUDE,latitude,Latitude north,Degrees,deg,Degrees North % LONGITUDE,longitude,Longitude east,Degrees,deg,Degrees East % PRES_REL,sea_water_pressure_due_to_seawater,Pressure (measured variable) in the water body exerted by overlying sea water only,Decibars,dBar, % PSAL,sea_water_salinity,Practical salinity of the water body,Practical Salinity Unit,PSU, % TEMP,sea_water_temperature,Temperature of the water body,Degrees Celsius,degC, % TURB,,Turbidity of the water body,Nephelometric Turbidity Units,NTU, % % qc_scheme_short_name,flag_value,flag_meaning,flag_description % IMOS IODE,0,No QC performed,The level at which all data enter the working archive. They have not yet been quality controlled % IMOS IODE,1,Good data,Top quality data in which no malfunctions have been identified and all real features have been verified during the quality control process % IMOS IODE,2,Probably good data,Good data in which some features (probably real) are present but these are unconfirmed. Code 2 data are also data in which minor malfunctions may be present but these errors are small and/or can be successfully corrected without seriously affecting the overall quality of the data. % IMOS IODE,3,Bad data that are potentially correctable,Suspect data in which unusual,, and probably erroneous features are observed % IMOS IODE,4,Bad data,Obviously erroneous values are observed % IMOS IODE,5,Value changed,Altered by a QC Centre,, with original values (before the change) preserved in the history record of the profile. eMII discourage the use of this flag. Where data values must be changed (e.g. smoothing of data sets) we strongly prefer that the original data be retained and an additional variable be added to accommodate the interpolated/corrected data values. % IMOS IODE,6,Not used,Flag 6 is reserved for future use % IMOS IODE,7,Not used,Flag 7 is reserved for future use % IMOS IODE,8,Interpolated value,Indicates that data values are interpolated % IMOS IODE,9,Missing value,Indicates that the element is missing % FID,timeseries_id,index,site_code,platform_code,deployment_code,instrument_nominal_depth,TIME,LATITUDE,LONGITUDE,DEPTH,DEPTH_burst_sd,DEPTH_burst_min,DEPTH_burst_max,DEPTH_num_obs,TEMP,TEMP_burst_sd,TEMP_burst_min,TEMP_burst_max,TEMP_num_obs,PRES_REL,PRES_REL_burst_sd,PRES_REL_burst_min,PRES_REL_burst_max,PRES_REL_num_obs,PSAL,PSAL_burst_sd,PSAL_burst_min,PSAL_burst_max,PSAL_num_obs,DOX1_1,DOX1_1_burst_sd,DOX1_1_burst_min,DOX1_1_burst_max,DOX1_1_num_obs,DOX1_2,DOX1_2_burst_sd,DOX1_2_burst_min,DOX1_2_burst_max,DOX1_2_num_obs,DOX1_3,DOX1_3_burst_sd,DOX1_3_burst_min,DOX1_3_burst_max,DOX1_3_num_obs,DOX2,DOX2_burst_sd,DOX2_burst_min,DOX2_burst_max,DOX2_num_obs,TURB,TURB_burst_sd,TURB_burst_min,TURB_burst_max,TURB_num_obs,CPHL,CPHL_burst_sd,CPHL_burst_min,CPHL_burst_max,CPHL_num_obs,CHLU,CHLU_burst_sd,CHLU_burst_min,CHLU_burst_max,CHLU_num_obs,CHLF,CHLF_burst_sd,CHLF_burst_min,CHLF_burst_max,CHLF_num_obs,geom,depth_b,sea_water_temperature_b,sea_water_pressure_due_to_sea_water_b,sea_water_salinity_b,mole_concentration_of_dissolved_molecular_oxygen_in_sea_water_b,moles_of_oxygen_per_unit_mass_in_sea_water_b,turbidity_b,mass_concentration_of_chlorophyll_in_sea_water_b anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46cb,1426,0,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:14:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46ca,1426,1,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:29:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c9,1426,2,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:44:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c8,1426,3,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T00:59:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c7,1426,4,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:14:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c6,1426,5,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:29:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c5,1426,6,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:44:13.999,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true anmn_burst_avg_timeseries_data.fid--7f92ca0c1473df2a94c-46c4,1426,7,NRSMAI,NRSMAI-SubSurface,NRSMAI-SubSurface-121018,20,2012-10-18T01:59:14,-42.5967,148.2337,,,,,0,,,,,0,,,,,0,,,,,0,,,,,,,,,,0,,,,,,,,,,0,,,,,0,,,,,0,,,,,,,,,,,POINT (148.23371887207 -42.5966987609863),true,true,true,true,true,true,true,true

ggalibert commented 10 years ago

We could consider the following comment character '#' since it appears to be the only one automatically recognised by R (maybe @xhoenner can confirm). Matlab or Python can accommodate any comment character.

xhoenner commented 10 years ago

@ggalibert yes, # is the default character for commenting in R. As Fay suggested in her email, I suspect it'll be a lot easier for most people out there if the metadata header was in a separate csv file, and available for download as an option.

ggalibert commented 10 years ago

I'm OK to leave the metadata in a separate file but I'm a bit a dictator on this topic and wouldn't like the user to have an option to not download this metadata file. Both files could be zipped before downloading them which could actually improve the CSV download a great deal, saving some download bandwidth!

ggalibert commented 10 years ago

By the way, just for the record, R can handle any kind of comment style in an ASCII file and more specific features using read.csv(). I'm pretty much still in favour of everything in one file if possible (probably less technically challenging than 2 files being zipped...).

Ex. : data <- read.csv("test.csv", header = TRUE, sep = ",", comment.char = "%")

ggalibert commented 10 years ago

I'm changing the label of this issue to "bug". The problem here is a blocking one for everyone downloading CSV files since there is no easy way to figure out when does the metadata finishes and when the data starts in the file.

xhoenner commented 10 years ago

@ggalibert Yes R can open up your 'test.csv' file using the read.csv function with comment.char however users would still have to delete manually the metadata header (which might be an issue for large csv files) to be able to analyse the data since the data frame 'data' outputted by the read.csv function only has a single column. Adding the 'skip' argument to that function to disregard all the rows from the metadata header doesn't seem to make much of a difference, 'data' still remains a data frame with a single column.

I'm not saying there's no way around it, but the read.csv function certainly isn't such an easy solution and at this stage I remain much more inclined to have the metadata header in a separate file.

xhoenner commented 10 years ago

My bad, this works with the csv that @ggalibert sent me, R skips all the rows starting with a '%' and then reads the remaining rows as a data frame, all good.

I nonetheless still thinks it would be more straightforward for users to have the data and metadata header in separate files. Since we deliver a product, it is our task to make the life of customers as easy as possible to facilitate the usage of IMOS data and attract more people. I also like the fact that when I buy a tin of soup, the information regarding ingredients used for the soup is not hidden somewhere in the soup ^_^.

ggalibert commented 10 years ago

Haha :-)

Well, it implies that you're boiling only the soup, not the tin+soup! Some are apparently trying to digest both... ;-)