dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Reader for JSON format files #1345

Closed ggiavelli closed 1 year ago

ggiavelli commented 5 years ago

There is an XML reader. It would be useful to have a reader that supports the common JSON format with parameters to specify what is the ID field and what is the Source Field (and what is the title field? include title in processing boolean?)

reckart commented 5 years ago

Nowadays, instead of pointing people to the XML reader, I would advise them to implement their own reader for their specific XML (or JSON or whatever) format - and then point them to other existing readers for similar formats which could be used as templates. Generic readers like the XML reader are very limited to the kind of data they can process. There is a learning curve of course to implementing one's first reader, but I believe once that hurdle has been taken, implementing new readers is pretty straightforward. Readers which read one file as one document are the easiest. Readers which read multiple documents from multiple files (i.e. a file can in include multiple documents and there can be more than one file) require some thinking but there are existing readers in DKPro Core which can be taken as role models.

ggiavelli commented 5 years ago

Yes, I was looking at the XmlReader as the template. I basically just need to set the ID and source in the JCas object right?

On Wed, Mar 27, 2019 at 5:16 PM Richard Eckart de Castilho < notifications@github.com> wrote:

Nowadays, instead of pointing people to the XML reader, I would advise them to implement their own reader for their specific XML (or JSON or whatever) format - and then point them to other existing readers for similar formats which could be used as templates. Generic readers like the XML reader are very limited to the kind of data they can process. There is a learning curve of course to implementing one's first reader, but I believe once that hurdle has been taken, implementing new readers is pretty straightforward. Readers which read one file as one document are the easiest. Readers which read multiple documents from multiple files (i.e. a file can in include multiple documents and there can be more than one file) require some thinking but there are existing readers in DKPro Core which can be taken as role models.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1345#issuecomment-477368287, or mute the thread https://github.com/notifications/unsubscribe-auth/AFiM2xWdVfUBNh8z_VDdOMvPNpdM8SLfks5va-24gaJpZM4cO4Og .

--

Gianna Giavelli skype: Gia.Giavelli

reckart commented 5 years ago

Optimally, readers extend ResourceCollectionReaderBase or JCasResourceCollectionReader_ImplBase which is taking care of an awful lot of things already. Then they need to implement getNext(..) and its structure looks like this:

@Override
    public void getNext(CAS/JCas aCas)
        throws IOException, CollectionException
    {
        Resource res = nextFile();

        // This takes care of setting up the DocumentMetaData annotation
        // in a way that DKPro Core writers can work with it. Alternatively 
        // you can create the DocumentMetaData annotation yourself. Look
        // at the sources of initCas() to see how.
        initCas(aCas, res);

        try (InputStream is = CompressionUtils.getInputStream(res.getLocation(), res.getInputStream())) {
            // Read annotations from stream and add them to the CAS
            // Add document text to the CAS
        }
    }
reckart commented 5 years ago

The XmlReader is ancient and doesn't inherit from ResourceCollectionReaderBase or JCasResourceCollectionReader_ImplBase - it is not a terribly good template I'm afraid. The XmlTextReader would be a better one.

ggiavelli commented 5 years ago

thanks! If I get the JSON reader working I'll touch base on how to contribut

On Wed, Mar 27, 2019 at 5:25 PM Richard Eckart de Castilho < notifications@github.com> wrote:

Optimally, readers extend ResourceCollectionReaderBase or JCasResourceCollectionReader_ImplBase which is taking care of an awful lot of things already. Then they need to implement getNext(..) and its structure looks like this:

@Override public void getNext(CAS/JCas aCas) throws IOException, CollectionException { Resource res = nextFile();

    // This takes care of setting up the DocumentMetaData annotation
    // in a way that DKPro Core writers can work with it. Alternatively
    // you can create the DocumentMetaData annotation yourself. Look
    // at the sources of initCas() to see how.
    initCas(aCas, res);

    try (InputStream is = CompressionUtils.getInputStream(res.getLocation(), res.getInputStream())) {
        // Read annotations from stream and add them to the CAS
        // Add document text to the CAS
    }
}

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1345#issuecomment-477370595, or mute the thread https://github.com/notifications/unsubscribe-auth/AFiM22NK-KP_2oWcGiuFrUmuvbDAl3Iiks5va-_NgaJpZM4cO4Og .

--

Gianna Giavelli skype: Gia.Giavelli

reckart commented 5 years ago

The LxfReader is reading a JSON-based format and might also provide some insight.