EDIorg / ecocomDP

A dataset design pattern and R package for ecological community data.
https://ediorg.github.io/ecocomDP/
Other
32 stars 13 forks source link

Develop a naming scheme, so that not all tables of a type have the same name. #9

Closed mobb closed 5 years ago

mobb commented 7 years ago

Corinna's comment:

When I use the ecocom_dp for any new incoming community datasets, I can’t call the files all the same name. I.e., I can’t just use ‘observation’, ‘event’, etc. over and over again. I am already stumbling on this one dataset because they gave me raw observations (several per lake) and then summarized for each lake. I do want to archive both as people probably want both. So, what I have done for file name now is prefix it with the study and the postfix it with raw and summary. I.e., NTL_RS_Marcrophytes_observation_raw.csv and NTL_RS_Macrophytes_observation_summary.csv. Of course, they could go into one, but I am sure that would make it very difficult to use.

clnsmth commented 7 years ago

The naming convention proposed by @cgries makes sense to me (i.e. studyName_ecocomDPTableName). To accommodate versioning I propose we use _vNUMBER (e.g. NTL_RS_observation_summary_v15.csv for version fifteen). What do you think?

mobb commented 7 years ago

another option is to use source the packageId, as in edi_5_2_summary.csv

clnsmth commented 7 years ago

This is a good option as well. It seems the naming convention can be flexible but must include the ecocomDP table names. The L1 aggregator function should have little trouble identifying which of the 7 ecocomDP tables it is working with.

mobb commented 6 years ago

Reread Corinna's original comment. Her problem is different, eg, the dataset has 2 tables that each could be considered primary observations, and she is suggesting they could remain so. We need the dataset id (@cgries , please comment)

cgries commented 6 years ago

It's an early comment and I probably need to overhaul this whole dataset now: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-ntl.338.1

clnsmth commented 6 years ago

Hi @mobb and @cgries. Any progress on this front? I've created /documentation/practices/naming_tables.md to convey the recommendation once it's formulated.

cgries commented 6 years ago

There are a number of recommendations out there already that we could just adopt: https://daac.ornl.gov/datamanagement/#descriptive_filenames : File names should reflect the contents of the file and uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

File names should be constructed to contain only lower-case letters, numbers, and underscores – no spaces or special characters – for easy management by various data systems and to decrease software and platform dependency. Similar logic is useful when designing directory structures and names.

clnsmth commented 5 years ago

The implemented practice is a file name:

  1. Composed of lowercase letters and underscores.
  2. Beginning with an abbreviated dataset and/or project name (e.g. "w_fish_size").
  3. Concluding with the relevant ecocomDP table name (e.g. "observation").
  4. Joined with an underscore (e.g. "w_fish_size_observation.csv").

Example data package in the EDI Data Repository.

The above practice doesn't guarantee globally unique names, but globally unique names are not needed until the reuse/aggregation step, which is taken care of by the aggregate_ecocomDP() function.