AtesComp / rdf-transform

RDF Transform is an extension for OpenRefine to transform data into RDF formats.
Other
27 stars 7 forks source link

Run transformation in batch mode #14

Closed tjroamer closed 2 years ago

tjroamer commented 2 years ago

I found this extension very useful. We have a very large set of XML files that share same structure, so one mapping file can apply to them all. However, it would be tedious to manually load them one by one to OpenRefine. I'd like to ask whether there is a batch mode that allows us to run the transformation in command line. Thanks.

AtesComp commented 2 years ago

Indeed, this would be very useful. My current response is to direct you to the OpenRefine API which allows for just such a thing. In the code, you can see the server-side commands to see the the GET and POST request and response elements. I haven't specifically documented them for API use but it should be doable.

Let me know if you have any success or trouble with that and I'll look into documenting a specific use. After getting the data loaded, it should resolve to using a SaveRDFTransformCommand and an OpenRefine export command using one of the registered RDF Transform exports. Iterate on each data file / project.

thadguidry commented 2 years ago

@tjroamer You might also ask @felixlohmeier if it's already possible and what his thoughts are with his OpenRefine client

tjroamer commented 2 years ago

@tjroamer You might also ask @felixlohmeier if it's already possible and what his thoughts are with his OpenRefine client

Thanks for pointing out this tool. I tried it and found it works for projects that use "RDF Extension". I got same RDF files as those exported from the browser app. Unfortunately it does not support projects with "RDF Transform".

felixlohmeier commented 2 years ago

If RDF transform requires specific API calls, then it would be great to document them. I probably won't get around to implementing this until next year. Maybe a maintainer of another client library will be faster.

tjroamer commented 2 years ago

Indeed, this would be very useful. My current response is to direct you to the OpenRefine API which allows for just such a thing. In the code, you can see the server-side commands to see the the GET and POST request and response elements. I haven't specifically documented them for API use but it should be doable.

Let me know if you have any success or trouble with that and I'll look into documenting a specific use. After getting the data loaded, it should resolve to using a SaveRDFTransformCommand and an OpenRefine export command using one of the registered RDF Transform exports. Iterate on each data file / project.

Thanks. I tried to export an existing RDF-Transform project to Turtle. When I exported a turtle in the browser app using Export->RDF Transform->Pretty Exports->RDF as Turtle, the server console windows showed me [refine] POST /command/core/export-rows/myproject.ttl (9098ms), so I assume that this is the POST command I can use to execute the turtle export in batch mode. I ran the following command in a Postman window (2620164079995 is the project id):

http://localhost:3333/command/core/export-rows/myproject.ttl?project=2620164079995&

but I got the following error:

image

However, the following command works well, and I got the expected models

http://localhost:3333/command/core/get-models?project=2620164079995&

I assume that I was not using the correct API to export rows. You might be able to point out the errors.

Appreciate your help.

AtesComp commented 2 years ago

You're on the right path. I don't think the export command is the full URL or is missing components. It's a POST command, so It also needs to specify the export engine form parameters. See the OpenRefine API Export documentation. I'll take a closer look at it as well.

felixlohmeier commented 2 years ago

Here is an example with cURL that might help: https://gist.github.com/felixlohmeier/d76bd27fbc4b8ab6d683822cdf61f81d#file-templates-sh-L347

tjroamer commented 2 years ago

Thanks. I managed to get it work. I had forgotten the format parameter in the previous session.

@AtesComp I used the following configuration for the export-rows:

project = 2620164079995
format = turtle

But the result I got is the same as that exported via browser app Export->RDF as Turtle. This is not what I expected. Nevertheless, this might not be a suprise, because I did not tell OpenRefine any specifics about RDF-Transform. I assume RDF-Transform is using a specific engine to do export. It would be good to know what settings are necessary to get result like Export->RDF Transform->Pretty export->RDF as Turtle.

Thanks.

AtesComp commented 2 years ago

The client side code uses an extra type designator for the format. So, (note the space)

format = "RDF/XML (Pretty)"
format = "Turtle (Pretty)"
format = "Turtle* (Pretty)"
format = "N3 (Pretty)"
...etc.

Types are:

" (Pretty)"
" (Blocks)"
" (Flat)"
" (Binary)"

The type is added to the appropriate export formats only. I took these types from the Jena documentation on streams vs the pretty formats. See RFDFormats. Then there is:

"RDFNull (Test)"

Try that and let me know. For more, the client side code is at this location . See the constructExportRDF() function for strType and the #exportRDF function.

AtesComp commented 2 years ago

I should probably standardize these export format names on the actual Jena RDFFormat names. What do you think?

tjroamer commented 2 years ago

Great, it worked. I used the following configuration for the POST call:

# POST call
http://localhost:3333/command/core/export-rows
# parameters
project = 2620164079995
format = Turtle (Pretty)
# body, x-www-form-urlencoded
engine = {"facets":[],"mode":"record-based"}

Yes, it makes sense to take the documented Jena RDFFormat names. Thanks.

tjroamer commented 2 years ago

One step further towards the batch mode: I am about to create a project from an XML file with the OpenRefine API.

I have the following XML sample data:

<design>
  <name>mydesign</name>
  <port>
    <var>
      <name>var1</name>
    </var>
    <var>
      <name>var2</name>
    </var>
  </port>
  <port>
    <var>
      <name>var3</name>
    </var>
  </port>
</design>

The API call I used is:

# POST call
http://localhost:3333/command/core/create-project-from-upload
# body, form-data
project-name: mydesign
project-file: <I PASTED THE XML FILE CONTENT HERE>
format: text/xml
options: {"recordPath": ["design"], "trimStrings": true, "storeEmptyStrings": false}

The call has been executed successfully, but the created project does not contain any data. I suppose that I did not set the form-data correctly. Could you please take a look at? Thanks.

tjroamer commented 2 years ago

An update: I have been able to successfully create my XML project with this OpenRefine Client. Thanks your guys for your help!

AtesComp commented 2 years ago

That's fantastic.

I'll change the formats for the next release, so that WILL affect any command/core/export-rows calls. I'll add a wiki page for command line / batch processing.

Also, the next release requires OpenRefine 3.6 or better as it supports the updated Jena lib and is not backward compatible as the RDFProto format was not properly supported in the prior Jena Lib and I need to register all export types before RDF Transform is successfully loaded. I wish there was a way to detect what formats are available before registering them but I currently don't have a way to do that.

AtesComp commented 2 years ago

I've updated the code with commit 63ed0bb to finalize the changes to use the RDFFormat strings from Jena to identify the export formats (plus a few other tweeks). See the Batch wiki page for CL Batch processing.