Closed cboettig closed 6 years ago
@cboettig
Greetings! I have just pushed the file to the repo. I modified the example you provided to extract the names of the creators. Please take a look at it at your convenience and see if those are correct.
I am also a bit confused about the id and type. Since in earthcube.json the id is an url and type is something like "Person" or "PropertyValue", but here they are mapped to the file name and null respectively. There is also nothing in the eml file that seems to correspond to the id and type.
p.s. For some reason the first time I commit all other files were deleted, so I added them back.
Thank you!
@AlexLi0104
Looks like a good start. Good question about @id
and @type
, I have some concrete advice below, but it will definitely help to check out https://www.youtube.com/watch?v=vioCbTo3C-4 to get a better sense of how we use these two special elements. (You can also check out the official spec: https://json-ld.org/spec/latest/json-ld/)
You must have created a merge conflict somehow and then done a force push to erase the files, let's try and avoid that in future.
A few changes we want to make to this:
[ ] Rather than creating new files, try modifying https://github.com/boettiger-lab/eml2schema/blob/master/Notebooks/jq/eml_to_schema.jq directly to include the creator as well as the bits that are already there which get the "temporalCoverage" and "spatialCoverage".
[ ] A Creator can be either a "Person" or an "Organization", as it says under 'creator' option for Dataset: http://schema.org/Dataset . So for @type
, you want this to Person
whenever you see individualName
, and to be Organization
if the EML creator has an organizationName
instead of an individualName
. Does that make sense?
[ ] Using name
is okay (I see that's what they did in EarthCube), but it's not the best choice. Instead, look up the relevant type on Schema.org, in this case, http://schema.org/Person tells us the fields available for a Person. Note that among the options are givenName
and familyName
, which are more precise. So you can edit your example to use these.
[ ] In this example, we don't have an @id
for the creators
given. Your code is getting the @id
that belongs to the whole Dataset. So instead, your code should look for an @id
that is part of the creator element, which might look like this:
"creator":
{ "@id": "some_id",
"individualName": {
"givenName": "Aaron",
"surName": "Ellison"
}
}
your query will just create a null
value for @id
if none is found .
Make sense? have a stab at this and ping me again.
@cboettig
Thank you for the links that you provided! I have made several changes:
I added the code for the creator element directly into the eml_to_schema.jq file.
I changed the keys of the creator to givenName
and familyName
.
I also tried to make changes to type
based on whether individualName
or organizationName
is present, but I couldn't get the if statement to work. I followed the syntax from jq manual but it still shows syntax error. I also tried the R syntax and it also didn't work. Would you please take a look and see what's wrong?
I will also take a look at task 2 tonight. Thank you!
@AlexLi0104 Good work. I've just pushed a few edits to solve this. You'll note the solution is actually simpler than you were thinking, because all properties of an object have to be at the same level. That is, anything inside a pair of { }
in JSON-LD is a specific object, and every object can have it's own type and it's own id. In EML, we don't have types
for them explicitly, so this is a bit confusing. Usually the type is obvious, but we include it anyway. My earlier example had left type
off on the Place
object and geo
object, so I've added those back in as well.
That is, we are documenting a Dataset that has some creators, then both the Dataset and the creator have their own types:
{
"type": "Dataset",
creator: {
"type": "Person",
...
}
"spatialCoverage": {
"type": "Place",
...
}
}
You were putting the creator
type outside of creator
object, which is actually where the type
for Dataset belongs. Does this make sense? Things will seem easier once you get this basic object model down.
Minor side note: you've probably noticed sometimes we include the @
on @type
and @id
and sometimes omit it: the @
is there to indicate that this is a special JSON-LD term. Schema.org defines these as the same thing, literally, "id": "@id"
and "type": "@type"
, so when we're using Schema.org we have the option of omitting them. (EML doesn't have many @type
or @id
declarations, but we should use the @
there to be explicit).
Also, note that I've removed some of the additional files we don't need, and I've added the output file. It should be sufficient to continue editing the eml_to_schema.jq
and schema_to_eml.jq
file and rerunning the jq_maps.Rmd
in knitr to see how the output is improved by the additions you keep making to the jq
maps.
@cboettig
Oh I see! Thank you for point that out! Just a small question: in the code you posted the types are added manually, so if the creator is changed to organization for some reason then the type need also be manually changed. Is there a way to use if
and elif
statements to cover all possible types of creators (and other objects), or is that actually more troublesome?
p.s. I also just submitted the URAP application on the website.
Thank you!
@AlexLi0104 Good question. Well, since we're passing individualName
to that chunk, it will not create a name for an Organization
, though it still will create an empty Person
object so that's not ideal. Right, we can do this with if
and elif
, I've just pushed an example of that.
Still, my example is probably not perfect, we'll probably need to add additional logic to this to deal with certain cases. I've also just modified the hf205
example to include an Organization
with an address
in the creator list. Have at and see what you come up with, might need a good deal of experimentation with various jq
functions, or perhaps using multiple jq
calls etc.
@cboettig
I just pushed the jq function with organization
and address
in it. Since individualName
is under creator
and organizationName
is under contact
, I used several if
statements and the unique
and sort
functions to put them all under the creator list. The code does seem a bit cumbersome. I also changed null
to empty
, so that the "null" doesn't show up in the output.
Please take a look and let me know what can be improved.
Thank you very much!
@AlexLi0104 nice work on empty
, that's much cleaner. You're making great progress on mastering the jq
syntax here, nice use of unique
and if
statements.
Unfortunately, you cannot just assume that all contacts are Organizations and and that only Organizations have addresses etc etc. The key concept here is to get used to thinking in terms of object classes. This applies equally to Schema.org and EML. We don't want to code something that works just for this one single example file and nothing else.
In general (i.e. in both schema.org and EML), both creator
and a contact
take objects that are either Person
or Organization
. Organizations and People can both have names, addresses, etc. In schema.org you can tell something is an Organization and not a Person because of the @type
, (though you'll also see it has lots of properties that are not shared with Person, compare http://schema.org/Organization and http://schema.org/Person). In EML, we don't have the convenience of @type
, so you have to decide based on the presence of organizationName
or individualName
alone. We could make our example richer by adding addresses to some of the creators, but obviously we can't illustrate all possible configurations in an example, so we really have to consult the original specifications directly for that.
@cboettig
I just pushed the modified version of adding organization and address. This code basically screens for creator
and contact
(I assume these are the only two that take objects Person
or Organization
), and then it screens for whether individualName
or organizationName
is present. If it does, then it further screens for whether address
is present. This I think should be able to apply to different documents (though I am not sure whether this is what you expected). There are several parts that I don't quite understand:
I have not yet figured out how to directly look for individualName
and organizationName
, without going through creator
and contact
first. But I will keeping trying during the weekend.
I don't understand why there's a University of California
in the output, since it doesn't exist in the hf205
document.
The address 324 North Main Street
is associated with both a person and an organization under contact
, and I am not sure whether it should be displayed twice (like what I did in the code).
Thank you very much, and have a nice weekend!
Nice, I'll take a look. Remember to just re-knit the jq_maps.Rmd
and then commit the updated jq_maps.md
, that's an easy way for me to see what the current output looks like on our examples.
Unfortunately, there are indeed other fields that can also take Person/Organization objects, though creator
and contact
are certainly the most common. At this stage it would be good just to spend a bit more time getting familiar with the EML schema as described here: https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml.html
EML is technically an XML based representation, though it's usually pretty easy to see how it should look in JSON, (e.g.
<individualName>
<givenName>Bob</givenName>
<surName>Smith</surName>
</individualName>
becomes
"individualName": {
"givenName": "Bob",
"surName": "Smith"
}
You can use the https://github.com/cboettig/emld R package to turn the XML versions of EML into JSON. You might start by translating this list of test files: https://github.com/cboettig/emld/tree/master/inst/tests from XML into JSON. Eventually, if we can map all of those into schema.org we will have at least covered the vast majority of typical EML files.
@cboettig
Greetings! I ran into a problem when I was trying to convert XML to JSON. I was running
as_emld('inst/tests/citation-sbclter-bibliography.50.xml')
and the error says
Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
(right here) ------^
Called from: parse_con(txt, bigint_as_char)
and the code that has the problem is the following:
function (con, bigint_as_char)
{
stopifnot(inherits(con, "connection"))
if (!isOpen(con)) {
on.exit(close(con))
open(con, "rb")
}
.Call(R_parse_connection, con, bigint_as_char)
}
What I did was try using as_emld to convert XML to emld, and then using as_json on the previous object to get JSON. I can't figure out what went wrong above (since I really don't know what the code means). Would you please take a look at it and let me know what I did wrong? At the mean time I will look at translating from Schema.org to EML.
Thank you very much!
@AlexLi0104 thanks for the update. I can't reproduce that error unfortunately. Can you try:
emld::as_emld(system.file('tests/citation-sbclter-bibliography.50.xml', package='emld'))
Also let me know the output of sessionInfo()
after you run that? (you might try updating your packages?) Does that error happen on all the files or just that one?
@cboettig
The error happens for all files. When I ran the line the error says:
Error in loadNamespace(name) : there is no package called ‘emld’
And when I tried to install emld packages it says that it is not availble for R 3.3.3. I wonder if there's a requirement on the R version and that maybe I shoud install a later version of R (I will try that and let you know whether it works)? Also the sessionInfo()
gives:
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] jqr_1.0.0 yaml_2.1.14 jsonlite_1.5 jsonld_1.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] readr_1.1.1 magrittr_1.5 lazyeval_0.2.0 R6_2.2.2 hms_0.3 tools_3.3.3
[7] tibble_1.3.4 curl_3.0 V8_1.5 Rcpp_0.12.12 knitr_1.17 rlang_0.1.6
Please let me know what I should do next. Thank you very much!
@AlexLi0104 Ah, that would explain it. You will have to install the emld
package first; it is not on the CRAN repository yet so install.packages()
fails, as you' saw. However, you can install the package directly from the GitHub repo by using a function from the devtools
package (which is on CRAN), as described in the README: https://github.com/cboettig/emld#emld
@cboettig
Thanks for the tip and the problem is solved! I just pushed all the translated json files (in the output
folder) to eml2schema
. I wasn't able to push it to emld
since there's an error that says permission denied.
Please take a look at your convenience and let me know if those files are translated correctly.
Thank you very much!
@AlexLi0104
Nice work, looks good! Might make more sense to put this a folder called eml
inside the current examples
folder (since they are examples of EML json markup). Other than that these are great, and should be helpful for your further development and testing of the eml2schema.jq
mapping.
Feel free to work more on that or on the schema2eml.jq
(should be a bit easier, though we have only one input document, and you'll have to spend some time getting familiar with what the resulting EML json files should look like -- the examples you just made should be helpful for that too).
Forge ahead and ping me with any questions! Also if there's anything that would be easier to go through in person, let me know and we can set up a meeting time.
@cboettig
Greetings! Sorry that I was very busy last week, and wasn't able to do much work. I am more free this week and will definitely work more on the project. I have uploaded the modified schema2eml.jq
file last week (with the added creator
element). Would you please take a look at it at your convenience and let me know whether that is the correct way to do it (I still feel a bit confused and unsure about what the code should look like)?
Other than that, should I also be including more elements in the eml2schema.jq
, since there are many files that I transformed into eml-JSON earlier last week?
Thank you very much!
Yup, forge ahead on the eml2schema.jq
as well. We don't have lots of examples that use Schema.org, but you'll find the Schema.org documents are pretty easy to read. For instance, just go to http://schema.org/Person to see all the possible fields for a Person
, or http://schema.org/Dataset to see the possible fields for a 'Dataset` in schema.org. (Usually if you scroll to the bottom you can also see examples in JSON-LD for these).
Since we're past 'getting started' I'll close this issue out, and I'll close #2 and open a new generic issue for any questions on EML to Schema. (we already have #3 for Schema to EML)
@AlexLi0104
Clone this repository and you should be able to edit the jq_maps.Rmd repo directly in RStudio. Mostly you'll be developing the associated
.jq
script to define the query.To get started, I'd recommend trying to add a map for the
creator
element in going from EML to Schema.org.