Starting task - Githubissues

cboettig commented 6 years ago

@AlexLi0104

Clone this repository and you should be able to edit the jq_maps.Rmd repo directly in RStudio. Mostly you'll be developing the associated .jq script to define the query.

To get started, I'd recommend trying to add a map for the creator element in going from EML to Schema.org.

AlexLi0104 commented 6 years ago

@cboettig

Greetings! I have just pushed the file to the repo. I modified the example you provided to extract the names of the creators. Please take a look at it at your convenience and see if those are correct.

I am also a bit confused about the id and type. Since in earthcube.json the id is an url and type is something like "Person" or "PropertyValue", but here they are mapped to the file name and null respectively. There is also nothing in the eml file that seems to correspond to the id and type.

p.s. For some reason the first time I commit all other files were deleted, so I added them back.

Thank you!

cboettig commented 6 years ago

@AlexLi0104

Looks like a good start. Good question about @id and @type, I have some concrete advice below, but it will definitely help to check out https://www.youtube.com/watch?v=vioCbTo3C-4 to get a better sense of how we use these two special elements. (You can also check out the official spec: https://json-ld.org/spec/latest/json-ld/)

You must have created a merge conflict somehow and then done a force push to erase the files, let's try and avoid that in future.

A few changes we want to make to this:

[ ] Rather than creating new files, try modifying https://github.com/boettiger-lab/eml2schema/blob/master/Notebooks/jq/eml_to_schema.jq directly to include the creator as well as the bits that are already there which get the "temporalCoverage" and "spatialCoverage".
[ ] A Creator can be either a "Person" or an "Organization", as it says under 'creator' option for Dataset: http://schema.org/Dataset . So for @type, you want this to Person whenever you see individualName, and to be Organization if the EML creator has an organizationName instead of an individualName. Does that make sense?
[ ] Using name is okay (I see that's what they did in EarthCube), but it's not the best choice. Instead, look up the relevant type on Schema.org, in this case, http://schema.org/Person tells us the fields available for a Person. Note that among the options are givenName and familyName, which are more precise. So you can edit your example to use these.
[ ] In this example, we don't have an @id for the creators given. Your code is getting the @id that belongs to the whole Dataset. So instead, your code should look for an @id that is part of the creator element, which might look like this:

"creator": 
      { "@id": "some_id",
        "individualName": {
          "givenName": "Aaron",
          "surName": "Ellison"
        }
      }

your query will just create a null value for @id if none is found .

Make sense? have a stab at this and ping me again.

AlexLi0104 commented 6 years ago

@cboettig

Thank you for the links that you provided! I have made several changes:

I added the code for the creator element directly into the eml_to_schema.jq file.
I changed the keys of the creator to givenName and familyName.
I also tried to make changes to type based on whether individualName or organizationName is present, but I couldn't get the if statement to work. I followed the syntax from jq manual but it still shows syntax error. I also tried the R syntax and it also didn't work. Would you please take a look and see what's wrong?

I will also take a look at task 2 tonight. Thank you!

cboettig commented 6 years ago

@AlexLi0104 Good work. I've just pushed a few edits to solve this. You'll note the solution is actually simpler than you were thinking, because all properties of an object have to be at the same level. That is, anything inside a pair of { } in JSON-LD is a specific object, and every object can have it's own type and it's own id. In EML, we don't have types for them explicitly, so this is a bit confusing. Usually the type is obvious, but we include it anyway. My earlier example had left type off on the Place object and geo object, so I've added those back in as well.

That is, we are documenting a Dataset that has some creators, then both the Dataset and the creator have their own types:

{ 

"type": "Dataset",
  creator: {
    "type": "Person",
      ...
  } 
  "spatialCoverage": {
    "type": "Place",
    ...
   }
}

You were putting the creator type outside of creator object, which is actually where the type for Dataset belongs. Does this make sense? Things will seem easier once you get this basic object model down.

Minor side note: you've probably noticed sometimes we include the @ on @type and @id and sometimes omit it: the @ is there to indicate that this is a special JSON-LD term. Schema.org defines these as the same thing, literally, "id": "@id" and "type": "@type", so when we're using Schema.org we have the option of omitting them. (EML doesn't have many @type or @id declarations, but we should use the @ there to be explicit).

Also, note that I've removed some of the additional files we don't need, and I've added the output file. It should be sufficient to continue editing the eml_to_schema.jq and schema_to_eml.jq file and rerunning the jq_maps.Rmd in knitr to see how the output is improved by the additions you keep making to the jq maps.

AlexLi0104 commented 6 years ago

@cboettig

Oh I see! Thank you for point that out! Just a small question: in the code you posted the types are added manually, so if the creator is changed to organization for some reason then the type need also be manually changed. Is there a way to use if and elif statements to cover all possible types of creators (and other objects), or is that actually more troublesome?

p.s. I also just submitted the URAP application on the website.

Thank you!

cboettig commented 6 years ago

@AlexLi0104 Good question. Well, since we're passing individualName to that chunk, it will not create a name for an Organization, though it still will create an empty Person object so that's not ideal. Right, we can do this with if and elif, I've just pushed an example of that.

Still, my example is probably not perfect, we'll probably need to add additional logic to this to deal with certain cases. I've also just modified the hf205 example to include an Organization with an address in the creator list. Have at and see what you come up with, might need a good deal of experimentation with various jq functions, or perhaps using multiple jq calls etc.

AlexLi0104 commented 6 years ago

@cboettig

I just pushed the jq function with organization and address in it. Since individualName is under creator and organizationName is under contact, I used several if statements and the unique and sort functions to put them all under the creator list. The code does seem a bit cumbersome. I also changed null to empty, so that the "null" doesn't show up in the output.

Please take a look and let me know what can be improved.

Thank you very much!

cboettig commented 6 years ago

@AlexLi0104 nice work on empty, that's much cleaner. You're making great progress on mastering the jq syntax here, nice use of unique and if statements.

Unfortunately, you cannot just assume that all contacts are Organizations and and that only Organizations have addresses etc etc. The key concept here is to get used to thinking in terms of object classes. This applies equally to Schema.org and EML. We don't want to code something that works just for this one single example file and nothing else.

In general (i.e. in both schema.org and EML), both creator and a contact take objects that are either Person or Organization. Organizations and People can both have names, addresses, etc. In schema.org you can tell something is an Organization and not a Person because of the @type, (though you'll also see it has lots of properties that are not shared with Person, compare http://schema.org/Organization and http://schema.org/Person). In EML, we don't have the convenience of @type, so you have to decide based on the presence of organizationName or individualName alone. We could make our example richer by adding addresses to some of the creators, but obviously we can't illustrate all possible configurations in an example, so we really have to consult the original specifications directly for that.

AlexLi0104 commented 6 years ago

@cboettig

I just pushed the modified version of adding organization and address. This code basically screens for creator and contact (I assume these are the only two that take objects Person or Organization ), and then it screens for whether individualName or organizationName is present. If it does, then it further screens for whether address is present. This I think should be able to apply to different documents (though I am not sure whether this is what you expected). There are several parts that I don't quite understand:

I have not yet figured out how to directly look for individualName and organizationName, without going through creator and contact first. But I will keeping trying during the weekend.
I don't understand why there's a University of California in the output, since it doesn't exist in the hf205 document.
The address 324 North Main Street is associated with both a person and an organization under contact, and I am not sure whether it should be displayed twice (like what I did in the code).

Thank you very much, and have a nice weekend!

cboettig commented 6 years ago

Nice, I'll take a look. Remember to just re-knit the jq_maps.Rmd and then commit the updated jq_maps.md, that's an easy way for me to see what the current output looks like on our examples.

Unfortunately, there are indeed other fields that can also take Person/Organization objects, though creator and contact are certainly the most common. At this stage it would be good just to spend a bit more time getting familiar with the EML schema as described here: https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml.html

EML is technically an XML based representation, though it's usually pretty easy to see how it should look in JSON, (e.g.

<individualName>
  <givenName>Bob</givenName>
  <surName>Smith</surName>
</individualName>

becomes

"individualName": {
  "givenName": "Bob",
  "surName": "Smith"
}

You can use the https://github.com/cboettig/emld R package to turn the XML versions of EML into JSON. You might start by translating this list of test files: https://github.com/cboettig/emld/tree/master/inst/tests from XML into JSON. Eventually, if we can map all of those into schema.org we will have at least covered the vast majority of typical EML files.

AlexLi0104 commented 6 years ago

@cboettig

Greetings! I ran into a problem when I was trying to convert XML to JSON. I was running

as_emld('inst/tests/citation-sbclter-bibliography.50.xml')

and the error says

Error in parse_con(txt, bigint_as_char) : parse error: premature EOF

                     (right here) ------^
Called from: parse_con(txt, bigint_as_char)

and the code that has the problem is the following:

function (con, bigint_as_char)
{
  stopifnot(inherits(con, "connection"))
  if (!isOpen(con)) {
    on.exit(close(con))
    open(con, "rb")
  }
  .Call(R_parse_connection, con, bigint_as_char)
}

What I did was try using as_emld to convert XML to emld, and then using as_json on the previous object to get JSON. I can't figure out what went wrong above (since I really don't know what the code means). Would you please take a look at it and let me know what I did wrong? At the mean time I will look at translating from Schema.org to EML.

Thank you very much!

cboettig commented 6 years ago

@AlexLi0104 thanks for the update. I can't reproduce that error unfortunately. Can you try:

emld::as_emld(system.file('tests/citation-sbclter-bibliography.50.xml', package='emld'))

Also let me know the output of sessionInfo() after you run that? (you might try updating your packages?) Does that error happen on all the files or just that one?

AlexLi0104 commented 6 years ago

@cboettig

The error happens for all files. When I ran the line the error says:

Error in loadNamespace(name) : there is no package called ‘emld’

And when I tried to install emld packages it says that it is not availble for R 3.3.3. I wonder if there's a requirement on the R version and that maybe I shoud install a later version of R (I will try that and let you know whether it works)? Also the sessionInfo() gives:

Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] jqr_1.0.0    yaml_2.1.14  jsonlite_1.5 jsonld_1.2   xml2_1.1.1  

loaded via a namespace (and not attached):
 [1] readr_1.1.1    magrittr_1.5   lazyeval_0.2.0 R6_2.2.2       hms_0.3        tools_3.3.3   
 [7] tibble_1.3.4   curl_3.0       V8_1.5         Rcpp_0.12.12   knitr_1.17     rlang_0.1.6

Please let me know what I should do next. Thank you very much!

cboettig commented 6 years ago

@AlexLi0104 Ah, that would explain it. You will have to install the emld package first; it is not on the CRAN repository yet so install.packages() fails, as you' saw. However, you can install the package directly from the GitHub repo by using a function from the devtools package (which is on CRAN), as described in the README: https://github.com/cboettig/emld#emld

AlexLi0104 commented 6 years ago

@cboettig

Thanks for the tip and the problem is solved! I just pushed all the translated json files (in the output folder) to eml2schema. I wasn't able to push it to emld since there's an error that says permission denied.

Please take a look at your convenience and let me know if those files are translated correctly.

Thank you very much!

cboettig commented 6 years ago

@AlexLi0104

Nice work, looks good! Might make more sense to put this a folder called eml inside the current examples folder (since they are examples of EML json markup). Other than that these are great, and should be helpful for your further development and testing of the eml2schema.jq mapping.

Feel free to work more on that or on the schema2eml.jq (should be a bit easier, though we have only one input document, and you'll have to spend some time getting familiar with what the resulting EML json files should look like -- the examples you just made should be helpful for that too).

Forge ahead and ping me with any questions! Also if there's anything that would be easier to go through in person, let me know and we can set up a meeting time.

AlexLi0104 commented 6 years ago

@cboettig

Greetings! Sorry that I was very busy last week, and wasn't able to do much work. I am more free this week and will definitely work more on the project. I have uploaded the modified schema2eml.jq file last week (with the added creator element). Would you please take a look at it at your convenience and let me know whether that is the correct way to do it (I still feel a bit confused and unsure about what the code should look like)?

Other than that, should I also be including more elements in the eml2schema.jq, since there are many files that I transformed into eml-JSON earlier last week?

Thank you very much!

cboettig commented 6 years ago

Yup, forge ahead on the eml2schema.jq as well. We don't have lots of examples that use Schema.org, but you'll find the Schema.org documents are pretty easy to read. For instance, just go to http://schema.org/Person to see all the possible fields for a Person, or http://schema.org/Dataset to see the possible fields for a 'Dataset` in schema.org. (Usually if you scroll to the bottom you can also see examples in JSON-LD for these).

Since we're past 'getting started' I'll close this issue out, and I'll close #2 and open a new generic issue for any questions on EML to Schema. (we already have #3 for Schema to EML)

boettiger-lab / eml2schema

Starting task #1