arsarabi / jsonvectorizer

Tools for extracting vector representations of JSON documents
MIT License
30 stars 5 forks source link

Learning nested JSON #1

Open FullPint opened 5 years ago

FullPint commented 5 years ago

Currently when running, all that is returned is the schema from "root", even though I have over 100,000 documents that have many nested attributes.

Currently in vectorizers there are the following:

        basevectorizer.py
    boolvectorizer.py
    numbervectorizer.py 
    stringvectorizer.py
    timestampvectorizer.py

Is there something I'm not quite understanding when it comes to "learning" deeper JSON than beyond 'root'?

arsarabi commented 5 years ago

The code should automatically learn the schema of nested documents. There was a bug in the sample code that I just fixed, that might have caused the issue. Use vectorizer.extend(docs) for learning the schema, where docs is a list of JSON documents, or use vectorizer.extend([doc]) when learning the schema incrementally.

jvmk commented 2 years ago

Hello arsarabi,

Thank you for making your code available.

I've also had no luck learning nested attributes. Do I need to define a vectorizer of type "object" to be able to learn nested JSON objects?

Suppose I have a set of documents that match the following schema:

{
  "nestedobject": {
    "stringattr1": "some string",
    "numberattr1": 42,
    "stringattr2": "another string"
  },
  "stringattr3": "a third string",
  "booleanattr1": true
}

...do I need to define additional vectorizers beyond those you provide in the sample code?

If I (only) use the vectorizers provided in the sample code, the only learned features are:

0: root has "booleanattr1"
1: root has "stringattr3"
2: root has "nestedobject"

Thank you in advance for answering this (very basic) usage question :).

arsarabi commented 2 years ago

Hello,

It has been a while since I worked on this but I believe it should work with nested JSON out of the box following the usage steps. Could you provide sample code that recreates the issue? Thanks!