Closed ggiavelli closed 1 year ago
Nowadays, instead of pointing people to the XML reader, I would advise them to implement their own reader for their specific XML (or JSON or whatever) format - and then point them to other existing readers for similar formats which could be used as templates. Generic readers like the XML reader are very limited to the kind of data they can process. There is a learning curve of course to implementing one's first reader, but I believe once that hurdle has been taken, implementing new readers is pretty straightforward. Readers which read one file as one document are the easiest. Readers which read multiple documents from multiple files (i.e. a file can in include multiple documents and there can be more than one file) require some thinking but there are existing readers in DKPro Core which can be taken as role models.
Yes, I was looking at the XmlReader as the template. I basically just need to set the ID and source in the JCas object right?
On Wed, Mar 27, 2019 at 5:16 PM Richard Eckart de Castilho < notifications@github.com> wrote:
Nowadays, instead of pointing people to the XML reader, I would advise them to implement their own reader for their specific XML (or JSON or whatever) format - and then point them to other existing readers for similar formats which could be used as templates. Generic readers like the XML reader are very limited to the kind of data they can process. There is a learning curve of course to implementing one's first reader, but I believe once that hurdle has been taken, implementing new readers is pretty straightforward. Readers which read one file as one document are the easiest. Readers which read multiple documents from multiple files (i.e. a file can in include multiple documents and there can be more than one file) require some thinking but there are existing readers in DKPro Core which can be taken as role models.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1345#issuecomment-477368287, or mute the thread https://github.com/notifications/unsubscribe-auth/AFiM2xWdVfUBNh8z_VDdOMvPNpdM8SLfks5va-24gaJpZM4cO4Og .
Gianna Giavelli skype: Gia.Giavelli
Optimally, readers extend ResourceCollectionReaderBase
or JCasResourceCollectionReader_ImplBase
which is taking care of an awful lot of things already. Then they need to implement getNext(..)
and its structure looks like this:
@Override
public void getNext(CAS/JCas aCas)
throws IOException, CollectionException
{
Resource res = nextFile();
// This takes care of setting up the DocumentMetaData annotation
// in a way that DKPro Core writers can work with it. Alternatively
// you can create the DocumentMetaData annotation yourself. Look
// at the sources of initCas() to see how.
initCas(aCas, res);
try (InputStream is = CompressionUtils.getInputStream(res.getLocation(), res.getInputStream())) {
// Read annotations from stream and add them to the CAS
// Add document text to the CAS
}
}
The XmlReader
is ancient and doesn't inherit from ResourceCollectionReaderBase
or JCasResourceCollectionReader_ImplBase
- it is not a terribly good template I'm afraid. The XmlTextReader
would be a better one.
thanks! If I get the JSON reader working I'll touch base on how to contribut
On Wed, Mar 27, 2019 at 5:25 PM Richard Eckart de Castilho < notifications@github.com> wrote:
Optimally, readers extend ResourceCollectionReaderBase or JCasResourceCollectionReader_ImplBase which is taking care of an awful lot of things already. Then they need to implement getNext(..) and its structure looks like this:
@Override public void getNext(CAS/JCas aCas) throws IOException, CollectionException { Resource res = nextFile();
// This takes care of setting up the DocumentMetaData annotation // in a way that DKPro Core writers can work with it. Alternatively // you can create the DocumentMetaData annotation yourself. Look // at the sources of initCas() to see how. initCas(aCas, res); try (InputStream is = CompressionUtils.getInputStream(res.getLocation(), res.getInputStream())) { // Read annotations from stream and add them to the CAS // Add document text to the CAS } }
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1345#issuecomment-477370595, or mute the thread https://github.com/notifications/unsubscribe-auth/AFiM22NK-KP_2oWcGiuFrUmuvbDAl3Iiks5va-_NgaJpZM4cO4Og .
Gianna Giavelli skype: Gia.Giavelli
The LxfReader
is reading a JSON-based format and might also provide some insight.
There is an XML reader. It would be useful to have a reader that supports the common JSON format with parameters to specify what is the ID field and what is the Source Field (and what is the title field? include title in processing boolean?)