Unique name for datasets

AlexisRenchon commented 2 years ago

Hi Ben et al.,

I like to have a unique, short name / identifier for each dataset, that I retrieve from the filenames. COSORE filename look like e.g., data_d20190626_VARGAS.csv

in which I retrieve "VARGAS" as an ID. However, some authors have multiple dataset, e.g., data_d20200305_VARGAS.csv

I can work around and write a script that gets "VARGAS_2" as ID for the second dataset, but maybe it could be done directly in COSORE.

for example, it is already done for e.g., data_d20200212_KAYE_LNE.csv data_d20200212_KAYE_LNW.csv

or data_d20190610_SIHI_H1.csv data_d20190610_SIHI_H2.csv

This is not an essential change, but it could make things slightly easier for users.

Best, Alexis

NOTE, here's what my script currently look like:

julia> # get the path of all COSORE input files
       inputs = readdir(joinpath("Input", "COSORE", "datasets"), join = true);

julia> # example of a path name
       inputs[1]
"Input/COSORE/datasets/data_d20190409_ANJILELI.csv"

julia> # Retrieve a short name for each dataset
       Names = []
Any[]

julia> # in loop below,
       # 38 is the number of character in e.g., "Input/COSORE/datasets/data_d20190409_" 
       # 4 is the number of character in ".csv"
       [push!(Names, inputs[i][38:end-4]) for i = 1:length(inputs)];

julia> Names
82-element Vector{Any}:
 "ANJILELI"
 "ZOU"
 "VARNER"
 "ZHANG_maple"
 "ZHANG_oak"

julia> # Create a Dictionary with name => dataframe
       # e.g., Data["ZOU"] is ZOU site dataframe
       Data = Dict(Names .=> [[] for i in 1:length(Names)]);

julia> [push!(Data[Names[i]], DataFrame(CSV.File(inputs[i]))) for i = 1:length(Names)];

julia> # Example
       Data["ZOU"][1]
82314×10 DataFrame
   Row │ CSR_PORT  CSR_TIMESTAMP_BEGIN  CSR_TIMESTAMP_END    CSR_FLUX_CO2  CSR_FLUX_CH4  CSR_ ⋯
       │ Int64     String               String               Float64       String        Stri ⋯
───────┼───────────────────────────────────────────────────────────────────────────────────────
     1 │        1  2013-12-01 00:15:58  2013-12-01 00:17:58          1.52  NA            Exp  ⋯
     2 │        2  2013-12-01 00:19:42  2013-12-01 00:21:42          1.55  NA            Exp
     3 │        3  2013-12-01 00:23:26  2013-12-01 00:25:26          0.99  NA            Lin

bpbond commented 2 years ago

Hi @AlexisRenchon -

I'm not sure I understand. The existing COSORE dataset name doesn't work for you because you might get two "VARGAS" names if you just strip out the initial date part? I.e., the names aren't guaranteed to be unique after stripping the date? Just want to make sure I understand the need/use case here. Thanks!

AlexisRenchon commented 2 years ago

Hi @bpbond -

Yes, you understood correctly! I like to have unique names after stripping out the initial date part, just like you said. It is not a big deal, but maybe it can be a small improvement of convenience for some users in future versions.

As I explained, I load COSORE database into a "matrix of matrix" called Data, and then I access each dataset by a short unique name, e.g., Data["Vargas"]. I could use the full name with date, but it would make it long to type, etc. Am I the only person doing this?

If this is just me, I could create my own Array with short names mapped to each dataset (instead of stripping out dates from filenames).

I am closing the issue, feel free to make change or not to filenames, this was just some thoughts =)

AlexisRenchon commented 2 years ago

Hi @bpbond , I know I closed this issue, but I am coming back to it briefly. I am doing some work with FLUXNET, and this reminded me of their standardized site ID: e.g., AU-Ade, US-Me1, US-Me2, ... Which is pretty neat: two capital letter for the continent, a dash, 3 characters, 3 letters if unique site, 2 letter and 1 number if multiple sites. COSORE could do something similar, as it is also a global database. Even better, using the same convention as FLUXNET could help identify quickly sites that have both a flux tower and auto-resp, e.g., AU-Cum.

bpbond / cosore

Unique name for datasets #242