SoilBGC-Datashare / sidb

Soil Incubation Database sidb
https://soilbgc-datashare.github.io/sidb/
MIT License
17 stars 11 forks source link

Inconsistencies in yaml.load_file #3

Closed jb388 closed 5 years ago

jb388 commented 5 years ago

@crlsierra Some of the .yaml files load with a different structure despite looking visually identical.

For example, templates Rey2008, Haddix2011, Doetterl2015, Craine2010NatGeo, and Bradford2010 all load the fields "MAP" and/or "MAT" within siteInfo as lists, whereas all other templates load these fields as single dimensional vectors.

A similar issue happens with reading the variables table. I wrote some code to fix this, but 11 out of the 31 templates read every field within each Vn list as a list, whereas in the remaining 20 templates every field is either a factor or a integer vector.

Any insight into this? I'm thinking maybe it could be a character encoding thing, i.e. something invisible to the eye, but picked up by the parser when loading the files. Perhaps this is linked to the computer on which the template originated, which unfortunately is nearly impossible to trace.

crlsierra commented 5 years ago

I think you're right. This may be a parsing problem related to character encoding of the computer where the entry was created. I will look at this with more detail.

crlsierra commented 5 years ago

The problem is due to different types of elements in the arrays. For example, if the yaml file looks like this:

MAT:
 - 100
 - 95.9

the function r yaml.load() will interpret the first element as an integer and the second element as numeric (float). Therefore, the function will coerce these elements as a list and not as a vector. The best solution is to write arrays using the same type, e.g.

MAT: 
 - 100.0
 - 95.9

I fixed this problems on the entries you mentioned. Although the problem is fixed, we may think about implementing a test for this case.

jb388 commented 5 years ago

Good find, Carlos. Unfortunately, NULL values also cause this problem, cf. Haddix2011 and Bradford2010 (NULL value in the MAP field in the former, and NULL values in the landCover and vegNote fields.

I was able to work around this by adding a "handlers" argument to the readEntries fx:

handlers=list("float#fix"=function(x) as.character(x))

This obviously converts all float values to character, but it is pretty easy to fix using the type.convert fx after the yaml file has been loaded in R. This fix works for the NULL value in the MAP array of Haddix2011. Unfortunately I haven't been able to figure out how to write a handler function to convert integers into characters, so in order for the above fix to work the MAP values have to be entered with a floating point (e.g. "300.0", not "300").

The NULL value issue in the string fields (landCover, vegNote) can be replaced w/ "NA".

crlsierra commented 5 years ago

We can require users to input inter values with decimal figures. I don't know other way to get around it.

jb388 commented 5 years ago

Agreed. I would guess it will be uncommon to have missing data in one of those arrays, e.g. data for some sites and not others.