frictionlessdata / datapackage-java

A Java library for working with Frictionless Data Data Packages.
MIT License
20 stars 7 forks source link

Migrate out of org.json library #26

Closed wetneb closed 4 years ago

wetneb commented 6 years ago

Using org.json as a JSON library is problematic, as the licence of this library includes an additional clause "The Software shall be used for Good, not Evil."

This is regarded as a non-free licence (it is non OSI-compliant).

That makes it impossible for projects relying on this library (such as OpenRefine) to be OSI-compliant in turn.

rufuspollock commented 5 years ago

@georgeslabreche can you take a look at this as this is a blocker for integrations.

lwinfree commented 5 years ago

The action item for this issue is to replace the org.json library in the datapackage-java code as it has a non OSI-compliant license. We should use a library with an open license such as MIT, Apache 2.0, or others listed on the OSI website.

rufuspollock commented 5 years ago

@lwinfree 👏 👏 it would be great to get this sorted as obviously usage in openrefine is 👍 💯

This should be pretty fast - it is just finding an OSI compliant java json lib and then refactoring to use that. If @roll can't sort this maybe someone at @datopian can help.

iSnow commented 5 years ago

Based on usage stats, it's a choice between either GSON or Jackson, the others are either too old or small projects with relatively little use. For those two projects, long-term survival is not an issue.

I would question whether exposing methods tying the external API to one specific JSON provider is a good design choice, though. I'd think it should use one lib internally and only communicate via JSON strings, not JSON objects. This way, a client is free in choosing their JSON provider.

lwinfree commented 5 years ago

Thanks @iSnow! Do you think you would have time to work on this? Or do you think you could help me estimate how much work this would be? I don't know any Java, so it is hard for me to estimate how long it would take to fix this.

iSnow commented 5 years ago

Sure, @lwinfree.

I am currently in discussion with @roll about an issue I created that deals with streamlining the datapackage API. Part of that is getting the dependency on org.json classes out of the interface so people could use datapackage in their projects and not be tied to one JSON provider.

If that is accepted and the interface changes are landed, I would work on switching over to another JSON provider internally as well - it's not a super small task as the org.json classes are used for a lot of things, among them conversion to CSV. But it's totally doable.

Working on both in parallel is a bit too much for me, even if we split off into a feature branch - since we also have to redo JSON handling in tableschema-java, there would be too many moving parts.

lwinfree commented 5 years ago

@iSnow awesome. It is really great to have your input and help!

rufuspollock commented 4 years ago

@iSnow any update here? It would be so great to get the licensing issue resolved here. This would unblock getting this back in OpenRefine.

iSnow commented 4 years ago

I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic. Since the organization I work for can live with the strange json.org license, other issues like more rigid parsing, security and decent test coverage have a higher prio. And, unfortunately, working on this in parallel isn't really feasible, too many moving parts.

lwinfree commented 4 years ago

Thanks for the update @iSnow! That is really helpful information, and I understand that you have higher priorities. To try and get some additional help on the issue, we are going to post this on bountysource (if you end up having the time to work on it, please accept the bounty yourself :-) ).

lwinfree commented 4 years ago

Here's the bounty: https://www.bountysource.com/issues/62062088-migrate-out-of-org-json-library

wetneb commented 4 years ago

It's always hard to put a price tag on an issue, but in this case the bounty seems pretty small compared to the task. (just to be clear: I am not trying to bargain, I have no intention to work on this myself)

I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic.

I agree - for instance the licenses in a Data Package are stored as JSON objects directly, using the underlying library, as the example in the README shows. It would probably be healthier to change this (by introducing the appropriate License class to store this data, with appropriate serialization and deserialization).

Given that migrating out of org.json will probably break a lot of interfaces, it might be the appropriate time to also tackle more profound architectural changes (for instance those suggested in #29). But of course that means more work - it could well mean rewriting most of the code!

So in terms of effort estimate, I would rate this on the same sort of order of magnitude as what it took to write this library in the first place, IMHO.

iSnow commented 4 years ago

Thanks for the write-up, @wetneb - that's exactly it.

I am at liberty to spend some time on improving the code on company costs, but not on topics that don't matter to the org (like this one) and not the amount of hours that would be needed for refactoring both libraries to a different JSON provider. I also like working on it in my free time, but fixing licensing issues is not that interesting, tbh.

What I did with the tableschema-java repo is that I removed all API exposure of json.org classes (replaced with Java types where easily possible and String where not), which is the first step towards moving to Jackson (preferably) or GSON. I'll most likely do this with the datapackages-java code as well, as I wrote in my comment above but that's where my commitment ends.

Consequently I removed myself from this issue - if maybe someone drops by who wants to work on this, please go ahead, but check back here first, so I can give some guidance (eg. start with tableschema-java which is more mature and has better tests right now)

lwinfree commented 4 years ago

That sounds good @iSnow, and thanks for the feedback @wetneb. We're going to look at our budget and increase the bounty to better reflect the amount of work.

lwinfree commented 4 years ago

Hi @iSnow & others - we've updated the bounty amount based on your feedback! Again, thanks for the comments :-)

devmindyg commented 4 years ago

hello @lwinfree and others. I would like to work on this issue, if you guys say it's ok. I have a free time and can dedicate some hours per day for this project (and need extra money ;) ).

The points raised by @iSnow and @wetneb could be a start What do you guys say?

iSnow commented 4 years ago

Sounds great, @devmindyg.

Please use Jackson as a replacement JSON library as I am already relying on it in a small way. Please start with the Tableschema-java project (https://github.com/frictionlessdata/tableschema-java) as this is the more mature project and the Datapackage-java project depends on it. Please check back in with me before tackling the Datapackage-java project, as this needs more work and we need to plan and align our work packages.

To clarify: this is not about just dropping out the json.org library and replacing it with Jackson code, but about improving the library to some extend. The projects were built around taking a JSON object, storing it, taking properties from it, and writing it back out. This leads to a lot of ugly code and switch/case statements as it disregards Java type safety. I changed a lot to make it more idiomatic Java, meaning object inheritance and generic types. Please continue in this spirit.

You might want to start with moving the tests to Jackson so you know you have working tests and then switch the library code to Jackson. Jackson features the ObjectMapper that allows you to throw in a JSON string and get back a Java object, not just a Map<String, Object>.

It probably would be best if you (and I) worked on feature branches and committed to trunk once a work package is done.

wetneb commented 4 years ago

Jackson would indeed be a sensible choice to migrate to. It has some support for JSON schema, so that could potentially replace the current "Everit" implementation: https://github.com/FasterXML/jackson-module-jsonSchema

iSnow commented 4 years ago

To add to the comment by @wetneb the Everit JSON Schema parser library is specific to json.org JSON objects, it cannot be used with Jackson. So the schema validation logic needs to be refactored to use a different JSON Schema validator. The Jackson module mentioned is mostly for generating a JSON Schema from a POJO (which is something tableschema-java supports), and there's also https://github.com/networknt/json-schema-validator for JSON Schema validation.

devmindyg commented 4 years ago

@iSnow @wetneb points taken! @iSnow do you want to work on feature branches directly here on this repo? or can we move on with fork/PR for this refactoring? Today i'm working on tests with jackson on my fork https://github.com/devmindyg/datapackage-java

iSnow commented 4 years ago

Forking/PR is fine, @devmindyg, but please keep in mind you absolutely should start with tableschema-java as it is the lower-level library, it has decent test coverage, and is in a relatively stable state. datapackage-java (the one you cloned) must come later, it is currently in no state for changing the JSON lib.

The bounty isn't really clear about this, but if you only switch the JSON library for datapackage-java, it will still have a transitive dependency on json.org code via tableschema-java, and OpenRefine will still not be able to use datapackage-java.

@lwinfree take note, the scope of the bounty hasn't been clearly defined, if @devmindyg claims the bounty for just refactoring this here library, you do not get what you are looking for.

devmindyg commented 4 years ago

Thanks for the clarification @iSnow i'll start with tableschema-java

devmindyg commented 4 years ago

@iSnow i changed the tests to jackson on tableschema-java. Would you like me to submit a PR to a feature branch on tableschema-java, so you can see the progress?

iSnow commented 4 years ago

Hey @devmindyg thanks for working on the library. I absolutely like what I see, very good work, and please go ahead and submit a PR.

A couple of small things:

So again thanks and you have green light to submit a PR :)

devmindyg commented 4 years ago

Hi @iSnow sorry for the formatting changes, my editor was set to auto format code :( i'll be more careful regarding formatting. I changed newly added assert statements to use jupiter api and submitted a PR to the integration branch.

devmindyg commented 4 years ago

Hello again, @iSnow i started working on jackson refactoring in tableschema-java. There are some explicit type checks for org.json.JSONArray instances, I replaced the type checks for NodeArray from jackson, but as you stated here https://github.com/frictionlessdata/datapackage-java/issues/26#issuecomment-547765201 . It may be a good idea to check for generic ArrayList (or may be another collection?). We can still rely on jackson NodeArray or NodeObject to do the marshalings that we need and easy convert to the desired ArrayList. What do you think? Also, is it ok to keep the communication centralized here?

iSnow commented 4 years ago

Yes, let's keep this the main communication channel.

I'd rather we completely parse the JSON to Java collections/objects and then use type-safe methods to access their properties. Currently, things like Schema have fromJson() methods like initFromSchemaJson(String json) that does a lot of heavy-lifting. Ideally, it could be replaced by methods in ObjectMapper like Schema schema = mapper.readValue(jsonString, Schema.class) - conversely the Schema serialization via getJson() should be replaced by String jsonString = mapper.writeValueAsString(schema);

I am not 100% sure how to replace the fromJson() method in Schema though, because the matching subclass for each field is decided by the forType(String type, String name)-method in Field.java which does some reflection magic. Maybe it would be possible to serialize the field-array via Jackson MixIns. If that fails, using the StdDeserializer should work.

What I'd like to ask you is to either research this and come up with a better idea or to create a branch and to prototypically implement the Schema de-serialization with the help of MixIns or custom Deserializers so that we can see whether that's a route we should take considering effort and code size. The foreign keys aren't the important part, what is essential is the List<Field> fields needs to be the same as for the current de-serialization.

Oh, and please pull in the changes I made in the last two days.

wetneb commented 4 years ago

I haven't had a close look but the forType method sounds like something that could be replaced by a Jackson TypeIdResolver. Its use is explained here: https://www.baeldung.com/jackson-advanced-annotations#jsontypeidresolver

Or if the list of subclasses is fixed (and not meant to be extensible by users), then annotations could do it too: https://www.baeldung.com/jackson-inheritance#2-per-class-annotations

iSnow commented 4 years ago

Thanks, @wetneb, very helpful. Yes, the list of subclasses is fixed and is not user-extensible as the list of subclasses is tied to the schema definition by frictionlessdata.

rufuspollock commented 4 years ago

@wetneb @iSnow @lwinfree any update on progress here? What's needed to get this finished? It would be great to this in ... 😄

iSnow commented 4 years ago

@rufuspollock due to COVID-19, I am currently busy

Amraneze commented 4 years ago

@iSnow, @rufuspollock do you want me to start on tableschema-java ?

wetneb commented 4 years ago

It might be worth checking with @devmindyg first, to see if they got anywhere near completing this.

iSnow commented 4 years ago

@Amraneze You'll have to wait for @devmindyg to declare whether they are still working on it.

rufuspollock commented 4 years ago

@devmindyg could you let us know in the next few days if you are still working on this? If not we are planning to see if someone else could take it on.

rufuspollock commented 4 years ago

@lwinfree could you state in the description of the issue the details of the bounty as the bountysource link is giving an error atm.

jdbranham commented 4 years ago

Sounds like an interesting task. I recently lost some work due to covid, so looking to pick up some side jobs. I'd like to take a shot at it - if @devmindyg is not done.

rufuspollock commented 4 years ago

@jdbranham please go ahead - we have not heard from @devmindyg for a while. /cc @lwinfree @lauragift21

lwinfree commented 4 years ago

Thanks for your interest @jdbranham! @devmindyg thanks for your work on this, but we will need to move ahead here.

jdbranham commented 4 years ago

There were no existing conflicts, so I merged master -> integration and working on the changes in my fork of the integration branch. Thanks team!

jdbranham commented 4 years ago

@iSnow @wetneb - I've got a working prototype using Jackson's @JsonTypeInfo that should simplify the deserialization process in general. I'll check my progress in later this evening.

jdbranham commented 4 years ago

@iSnow @rufuspollock @lwinfree I've got [de]serialization working, but have a question about schema validation.

According to the everit library -

if you use Jackson to handle JSON in Java code, then java-json-tools/json-schema-validator is obviously a better choice, since it uses Jackson if you want to use the org.json API then this library is the better choice

Should I implement schema validation with this library? https://github.com/java-json-tools/json-schema-validator

Is the license acceptable ASL2.0 + LGPL?

jdbranham commented 4 years ago

Actually this one is active, Apache 2.0, and lightweight - https://github.com/networknt/json-schema-validator

I'll go ahead with this, but I've abstracted the schema validation, so it will be easier to switch if necessary.

iSnow commented 4 years ago

Sounds great, @jdbranham. It's clear that we have to move away from the everit library, as this is too tightly bound to the org.json API. I don't have a strong opinion on which JSON-schema lib to use, the one you found seems like a very good choice.

From what I read, you are on a good path using the Jackson features to read/write JSON, so please carry on :)

jdbranham commented 4 years ago

Thanks @iSnow - I ran into a snag with the CSV read/write. The jackson library doesn't support complex objects =\

Workaround for the deserialization process - when the table row is parsed, we can replace the 'speech' quotes the previous library used, with regular double quotes. So it should be backwards compatible.

For serialization, it was a little tricker. Basically we change the Json ArrayNode into a list of JsonNode objects, and for the ObjectNode sub-class, we just serialize the object as a string. All the tests pass, so I guess it's good =) References: https://github.com/FasterXML/jackson-dataformat-csv/issues/9 https://github.com/FasterXML/jackson-dataformats-text/pull/97

Question about one test - when we infer the schema, one of the tests is expecting the field title to be present, but I don't see it set anywhere. I could make the field title match the name during deserialization, but I wanted to make sure this didn't cause any unwanted side-effects.

Let me know what you guys think about this - if it's cool, then I'll fix it, and start on the datapackage-java updates.

https://github.com/frictionlessdata/tableschema-java/compare/master...savantly-net:integration

jdbranham commented 4 years ago

Actually the title isn't considered in equals - it was the constraint initialization. Problem fixed.

https://travis-ci.org/github/frictionlessdata/tableschema-java/builds/679147569

jdbranham commented 4 years ago

PR opened for datapackage-java I set the dependency for tableschema-java to work with local development, so we'll need to update the artifact coordinates to point to the correct version [once the tableschema-java PR is merged].

datapackage-java PR - https://github.com/frictionlessdata/datapackage-java/pull/35 tableschema-java PR - https://github.com/frictionlessdata/tableschema-java/pull/55

lwinfree commented 4 years ago

Hi @iSnow + @jdbranham! Thanks for all the work here! Has https://github.com/frictionlessdata/tableschema-java/pull/55 closed this issue? Or are there other parts remaining?

iSnow commented 4 years ago

hi @lwinfree - not fully, I need to review the PR for datapackage-java, which I will do in the coming days, then it should be good to close the issue and pay the bounty.

jdbranham commented 4 years ago

Do I need to do anything else? =)