Closed wetneb closed 4 years ago
@georgeslabreche can you take a look at this as this is a blocker for integrations.
The action item for this issue is to replace the org.json library in the datapackage-java code as it has a non OSI-compliant license. We should use a library with an open license such as MIT, Apache 2.0, or others listed on the OSI website.
@lwinfree 👏 👏 it would be great to get this sorted as obviously usage in openrefine is 👍 💯
This should be pretty fast - it is just finding an OSI compliant java json lib and then refactoring to use that. If @roll can't sort this maybe someone at @datopian can help.
Based on usage stats, it's a choice between either GSON or Jackson, the others are either too old or small projects with relatively little use. For those two projects, long-term survival is not an issue.
I would question whether exposing methods tying the external API to one specific JSON provider is a good design choice, though. I'd think it should use one lib internally and only communicate via JSON strings, not JSON objects. This way, a client is free in choosing their JSON provider.
Thanks @iSnow! Do you think you would have time to work on this? Or do you think you could help me estimate how much work this would be? I don't know any Java, so it is hard for me to estimate how long it would take to fix this.
Sure, @lwinfree.
I am currently in discussion with @roll about an issue I created that deals with streamlining the datapackage API. Part of that is getting the dependency on org.json classes out of the interface so people could use datapackage in their projects and not be tied to one JSON provider.
If that is accepted and the interface changes are landed, I would work on switching over to another JSON provider internally as well - it's not a super small task as the org.json classes are used for a lot of things, among them conversion to CSV. But it's totally doable.
Working on both in parallel is a bit too much for me, even if we split off into a feature branch - since we also have to redo JSON handling in tableschema-java, there would be too many moving parts.
@iSnow awesome. It is really great to have your input and help!
@iSnow any update here? It would be so great to get the licensing issue resolved here. This would unblock getting this back in OpenRefine.
I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic. Since the organization I work for can live with the strange json.org license, other issues like more rigid parsing, security and decent test coverage have a higher prio. And, unfortunately, working on this in parallel isn't really feasible, too many moving parts.
Thanks for the update @iSnow! That is really helpful information, and I understand that you have higher priorities. To try and get some additional help on the issue, we are going to post this on bountysource (if you end up having the time to work on it, please accept the bounty yourself :-) ).
Here's the bounty: https://www.bountysource.com/issues/62062088-migrate-out-of-org-json-library
It's always hard to put a price tag on an issue, but in this case the bounty seems pretty small compared to the task. (just to be clear: I am not trying to bargain, I have no intention to work on this myself)
I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic.
I agree - for instance the licenses in a Data Package are stored as JSON objects directly, using the underlying library, as the example in the README shows. It would probably be healthier to change this (by introducing the appropriate License
class to store this data, with appropriate serialization and deserialization).
Given that migrating out of org.json will probably break a lot of interfaces, it might be the appropriate time to also tackle more profound architectural changes (for instance those suggested in #29). But of course that means more work - it could well mean rewriting most of the code!
So in terms of effort estimate, I would rate this on the same sort of order of magnitude as what it took to write this library in the first place, IMHO.
Thanks for the write-up, @wetneb - that's exactly it.
I am at liberty to spend some time on improving the code on company costs, but not on topics that don't matter to the org (like this one) and not the amount of hours that would be needed for refactoring both libraries to a different JSON provider. I also like working on it in my free time, but fixing licensing issues is not that interesting, tbh.
What I did with the tableschema-java
repo is that I removed all API exposure of json.org classes (replaced with Java types where easily possible and String where not), which is the first step towards moving to Jackson (preferably) or GSON. I'll most likely do this with the datapackages-java
code as well, as I wrote in my comment above but that's where my commitment ends.
Consequently I removed myself from this issue - if maybe someone drops by who wants to work on this, please go ahead, but check back here first, so I can give some guidance (eg. start with tableschema-java
which is more mature and has better tests right now)
That sounds good @iSnow, and thanks for the feedback @wetneb. We're going to look at our budget and increase the bounty to better reflect the amount of work.
Hi @iSnow & others - we've updated the bounty amount based on your feedback! Again, thanks for the comments :-)
hello @lwinfree and others. I would like to work on this issue, if you guys say it's ok. I have a free time and can dedicate some hours per day for this project (and need extra money ;) ).
The points raised by @iSnow and @wetneb could be a start What do you guys say?
Sounds great, @devmindyg.
Please use Jackson as a replacement JSON library as I am already relying on it in a small way. Please start with the Tableschema-java project (https://github.com/frictionlessdata/tableschema-java) as this is the more mature project and the Datapackage-java project depends on it. Please check back in with me before tackling the Datapackage-java project, as this needs more work and we need to plan and align our work packages.
To clarify: this is not about just dropping out the json.org library and replacing it with Jackson code, but about improving the library to some extend. The projects were built around taking a JSON object, storing it, taking properties from it, and writing it back out. This leads to a lot of ugly code and switch/case statements as it disregards Java type safety. I changed a lot to make it more idiomatic Java, meaning object inheritance and generic types. Please continue in this spirit.
You might want to start with moving the tests to Jackson so you know you have working tests and then switch the library code to Jackson. Jackson features the ObjectMapper that allows you to throw in a JSON string and get back a Java object, not just a Map<String, Object>.
It probably would be best if you (and I) worked on feature branches and committed to trunk once a work package is done.
Jackson would indeed be a sensible choice to migrate to. It has some support for JSON schema, so that could potentially replace the current "Everit" implementation: https://github.com/FasterXML/jackson-module-jsonSchema
To add to the comment by @wetneb the Everit JSON Schema parser library is specific to json.org JSON objects, it cannot be used with Jackson. So the schema validation logic needs to be refactored to use a different JSON Schema validator. The Jackson module mentioned is mostly for generating a JSON Schema from a POJO (which is something tableschema-java supports), and there's also https://github.com/networknt/json-schema-validator for JSON Schema validation.
@iSnow @wetneb points taken! @iSnow do you want to work on feature branches directly here on this repo? or can we move on with fork/PR for this refactoring? Today i'm working on tests with jackson on my fork https://github.com/devmindyg/datapackage-java
Forking/PR is fine, @devmindyg, but please keep in mind you absolutely should start with tableschema-java as it is the lower-level library, it has decent test coverage, and is in a relatively stable state. datapackage-java (the one you cloned) must come later, it is currently in no state for changing the JSON lib.
The bounty isn't really clear about this, but if you only switch the JSON library for datapackage-java, it will still have a transitive dependency on json.org code via tableschema-java, and OpenRefine will still not be able to use datapackage-java.
@lwinfree take note, the scope of the bounty hasn't been clearly defined, if @devmindyg claims the bounty for just refactoring this here library, you do not get what you are looking for.
Thanks for the clarification @iSnow i'll start with tableschema-java
@iSnow i changed the tests to jackson on tableschema-java. Would you like me to submit a PR to a feature branch on tableschema-java, so you can see the progress?
Hey @devmindyg thanks for working on the library. I absolutely like what I see, very good work, and please go ahead and submit a PR.
A couple of small things:
The tests are a mixture of JUnit 4.x and JUnit 5.x (Jupiter). All files with tests I created are 5.x, and the ones I inherited are 4.x, even the individual tests I added there. I know this is not optimal, but I focused on getting the functionality into shape over migrating all tests to 5.x. I see that in DocumentationCases.java
, you imported org.junit.Assert.assertEquals
, which is 4.x. Could you kindly switch it to `org.junit.jupiterAssertions.assertEquals? And watchout which files are 4.x and which are 5.x - everything with a @DiplayName annotation is 5.x. If you feel like it, you are free to migrate tests to 5.x, of course, that would be highly appreciated.
Please do not forget to pull in the changes before submitting a PR. There shouldn't be conflicts, but I am not 100% sure as I pushed a couple of chances since you cloned the repo. I'll hold off pushing further commits till you submitted the PR. I pulled all changes to from master to branch integration
, so you can submit against that branch. If you'd rather I create a feature branch, tell me what you'd like to have it named.
While I applaud your efforts to make the code more readable and conforming to Java standards (spaces before curly braces, tabs vs. spaces and so on), please don't go overboard as I have to mentally ignore all those lines. I'd be happier if you separated PRs with code changes from PRs with formatting changes - it drastically cuts down on lines I have to go through as AFAIK the Github diff viewer doesn't allow me to ignore whitespace changes (or does it?). If you feel strongly about code formatting on the other hand, please go ahead and ignore this comment.
So again thanks and you have green light to submit a PR :)
Hi @iSnow sorry for the formatting changes, my editor was set to auto format code :( i'll be more careful regarding formatting. I changed newly added assert statements to use jupiter api and submitted a PR to the integration branch.
Hello again, @iSnow i started working on jackson refactoring in tableschema-java. There are some explicit type checks for org.json.JSONArray instances, I replaced the type checks for NodeArray from jackson, but as you stated here https://github.com/frictionlessdata/datapackage-java/issues/26#issuecomment-547765201 . It may be a good idea to check for generic ArrayList (or may be another collection?). We can still rely on jackson NodeArray or NodeObject to do the marshalings that we need and easy convert to the desired ArrayList. What do you think? Also, is it ok to keep the communication centralized here?
Yes, let's keep this the main communication channel.
I'd rather we completely parse the JSON to Java collections/objects and then use type-safe methods to access their properties. Currently, things like Schema
have fromJson()
methods like initFromSchemaJson(String json)
that does a lot of heavy-lifting. Ideally, it could be replaced by methods in ObjectMapper like Schema schema = mapper.readValue(jsonString, Schema.class)
- conversely the Schema serialization via getJson()
should be replaced by String jsonString = mapper.writeValueAsString(schema);
I am not 100% sure how to replace the fromJson()
method in Schema though, because the matching subclass for each field
is decided by the forType(String type, String name)
-method in Field.java which does some reflection magic. Maybe it would be possible to serialize the field-array via Jackson MixIns. If that fails, using the StdDeserializer should work.
What I'd like to ask you is to either research this and come up with a better idea or to create a branch and to prototypically implement the Schema de-serialization with the help of MixIns or custom Deserializers so that we can see whether that's a route we should take considering effort and code size. The foreign keys aren't the important part, what is essential is the List<Field> fields
needs to be the same as for the current de-serialization.
Oh, and please pull in the changes I made in the last two days.
I haven't had a close look but the forType
method sounds like something that could be replaced by a Jackson TypeIdResolver
. Its use is explained here:
https://www.baeldung.com/jackson-advanced-annotations#jsontypeidresolver
Or if the list of subclasses is fixed (and not meant to be extensible by users), then annotations could do it too: https://www.baeldung.com/jackson-inheritance#2-per-class-annotations
Thanks, @wetneb, very helpful. Yes, the list of subclasses is fixed and is not user-extensible as the list of subclasses is tied to the schema definition by frictionlessdata.
@wetneb @iSnow @lwinfree any update on progress here? What's needed to get this finished? It would be great to this in ... 😄
@rufuspollock due to COVID-19, I am currently busy
@iSnow, @rufuspollock do you want me to start on tableschema-java ?
It might be worth checking with @devmindyg first, to see if they got anywhere near completing this.
@Amraneze You'll have to wait for @devmindyg to declare whether they are still working on it.
@devmindyg could you let us know in the next few days if you are still working on this? If not we are planning to see if someone else could take it on.
@lwinfree could you state in the description of the issue the details of the bounty as the bountysource link is giving an error atm.
Sounds like an interesting task. I recently lost some work due to covid, so looking to pick up some side jobs. I'd like to take a shot at it - if @devmindyg is not done.
@jdbranham please go ahead - we have not heard from @devmindyg for a while. /cc @lwinfree @lauragift21
Thanks for your interest @jdbranham! @devmindyg thanks for your work on this, but we will need to move ahead here.
There were no existing conflicts, so I merged master -> integration
and working on the changes in my fork of the integration branch.
Thanks team!
@iSnow @wetneb - I've got a working prototype using Jackson's @JsonTypeInfo
that should simplify the deserialization process in general.
I'll check my progress in later this evening.
@iSnow @rufuspollock @lwinfree I've got [de]serialization working, but have a question about schema validation.
According to the everit library -
if you use Jackson to handle JSON in Java code, then java-json-tools/json-schema-validator is obviously a better choice, since it uses Jackson if you want to use the org.json API then this library is the better choice
Should I implement schema validation with this library? https://github.com/java-json-tools/json-schema-validator
Is the license acceptable ASL2.0 + LGPL?
Actually this one is active, Apache 2.0, and lightweight - https://github.com/networknt/json-schema-validator
I'll go ahead with this, but I've abstracted the schema validation, so it will be easier to switch if necessary.
Sounds great, @jdbranham. It's clear that we have to move away from the everit library, as this is too tightly bound to the org.json API. I don't have a strong opinion on which JSON-schema lib to use, the one you found seems like a very good choice.
From what I read, you are on a good path using the Jackson features to read/write JSON, so please carry on :)
Thanks @iSnow - I ran into a snag with the CSV read/write. The jackson library doesn't support complex objects =\
Workaround for the deserialization process - when the table row is parsed, we can replace the 'speech' quotes the previous library used, with regular double quotes. So it should be backwards compatible.
For serialization, it was a little tricker. Basically we change the Json ArrayNode
into a list of JsonNode
objects, and for the ObjectNode
sub-class, we just serialize the object as a string. All the tests pass, so I guess it's good =)
References:
https://github.com/FasterXML/jackson-dataformat-csv/issues/9
https://github.com/FasterXML/jackson-dataformats-text/pull/97
Question about one test - when we infer the schema, one of the tests is expecting the field title to be present, but I don't see it set anywhere. I could make the field title match the name during deserialization, but I wanted to make sure this didn't cause any unwanted side-effects.
Let me know what you guys think about this - if it's cool, then I'll fix it, and start on the datapackage-java
updates.
https://github.com/frictionlessdata/tableschema-java/compare/master...savantly-net:integration
Actually the title isn't considered in equals
- it was the constraint initialization.
Problem fixed.
https://travis-ci.org/github/frictionlessdata/tableschema-java/builds/679147569
PR opened for datapackage-java
I set the dependency for tableschema-java
to work with local development, so we'll need to update the artifact coordinates to point to the correct version [once the tableschema-java PR is merged].
datapackage-java PR - https://github.com/frictionlessdata/datapackage-java/pull/35 tableschema-java PR - https://github.com/frictionlessdata/tableschema-java/pull/55
Hi @iSnow + @jdbranham! Thanks for all the work here! Has https://github.com/frictionlessdata/tableschema-java/pull/55 closed this issue? Or are there other parts remaining?
hi @lwinfree - not fully, I need to review the PR for datapackage-java, which I will do in the coming days, then it should be good to close the issue and pay the bounty.
Do I need to do anything else? =)
Using org.json as a JSON library is problematic, as the licence of this library includes an additional clause "The Software shall be used for Good, not Evil."
This is regarded as a non-free licence (it is non OSI-compliant).
That makes it impossible for projects relying on this library (such as OpenRefine) to be OSI-compliant in turn.