dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

FileSetCollectionReaderBase sets collectionId wrong? #52

Closed reckart closed 9 years ago

reckart commented 9 years ago
The documentBaseUri and collectionId seem to be out of sync in FileSetCollectionReaderBase:

            // Set the document metadata
            DocumentMetaData docMetaData = new DocumentMetaData(aCas.getJCas());
            File file = aFile.getFile();
            docMetaData.setDocumentTitle(file.getName());
            docMetaData.setDocumentUri(file.toURI().toString());
            docMetaData.setDocumentId(aFile.getName());
            if (aFile.getBaseDir() != null) {
                docMetaData.setDocumentBaseUri(path.toURI().toString());
                docMetaData.setCollectionId(aFile.getBaseDir().getPath());
            }

I suppose the collectionId should resemble the documentBaseUri here.

Original issue reported on code.google.com by richard.eckart on 2012-04-12 10:16:50

reckart commented 9 years ago
This is how ResourceCollectionReaderBase does it.

        String qualifier = aQualifier != null ? "#"+aQualifier : "";
        // Set the document metadata
        DocumentMetaData docMetaData = new DocumentMetaData(aCas.getJCas());
        docMetaData.setDocumentTitle(new File(aResource.getPath()).getName());
        docMetaData.setDocumentUri(aResource.getResolvedUri().toString()+qualifier);
        docMetaData.setDocumentId(aResource.getPath()+qualifier);
        if (aResource.getBase() != null) {
            docMetaData.setDocumentBaseUri(aResource.getResolvedBase());
            docMetaData.setCollectionId(aResource.getResolvedBase()+qualifier);
        }

It also looks strange here that the qualifier is added to the collectionId as well
as to the documentId. It should only be added to the documentId I think. 

And FileSetCollectionReaderBase should be changed to use "path" as the collectionId
I suppose.

Changing this could break existing user code though.

Original issue reported on code.google.com by richard.eckart on 2012-04-12 10:20:03

reckart commented 9 years ago
Removed qualifier from collectionId" /Users/bluefire/UKP/Workspaces/dkpro
---
Committed revision 648.

Original issue reported on code.google.com by richard.eckart on 2012-05-12 12:56:26