dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Unescape %20 in file names in JCasFileWriter_ImplBase #1394

Closed alaindesilets closed 5 years ago

alaindesilets commented 5 years ago

This is my proposed fix for this issue:

https://github.com/dkpro/dkpro-core/issues/1392

ukp-svc-jenkins commented 5 years ago

Can one of the admins verify this patch?

reckart commented 5 years ago

The title of the PR said that the idea is to unescape the URL-encoded filename. But the PR code actually seems to encode the URL-encoded filename once more using HTML/XML ampersand encoding. Am I misunderstanding the change?

Would you mind setting up a unit test to check the change actually does what is intended?

alaindesilets commented 5 years ago

On Tue, Jul 16, 2019 at 3:44 PM Richard Eckart de Castilho < notifications@github.com> wrote:

The title of the PR said that the idea is to unescape the URL-encoded filename. But the PR code actually seems to encode the URL-encoded filename once more using HTML/XML ampersand encoding. Am I misunderstanding the change?

Oups... it shoudl have been StringEscapeUtils.unescapeHtml4().

I will change push a correction to the branch.

Would you mind setting up a unit test to check the change actually does what is intended?

Will do.

alaindesilets commented 5 years ago

On Wed, Jul 17, 2019 at 3:53 PM Richard Eckart de Castilho < notifications@github.com> wrote:

@reckart commented on this pull request.


In dkpro-core-api-io-asl/src/main/java/org/dkpro/core/api/io/JCasFileWriter_ImplBase.java https://github.com/dkpro/dkpro-core/pull/1394#discussion_r304612906:

@@ -295,20 +294,22 @@ protected String getRelativePath(JCas aJCas)

         if (stripExtension) {

             relativeDocumentPath = FilenameUtils.removeExtension(relativeDocumentPath);

         }

-

  • if (escapeDocumentId) {

  • try {

  • relativeDocumentPath = URLEncoder.encode(relativeDocumentPath, "UTF-8");

  • }

  • catch (UnsupportedEncodingException e) {

  • // UTF-8 must be supported on all Java platforms per specification. This should

  • // not happen.

  • throw new IllegalStateException(e);

  • }

  • }

  • try {

  • if (escapeFilename) {

  • relativeDocumentPath = URLEncoder.encode(relativeDocumentPath, "UTF-8");

I think we need the encode case only for the branch where documentID is used and the decode case only for the branch where baseURI/documentURI is used. If we leave it out here, then having PARAM_ESCAPE_FILENAME enabled would double-encode in the baseURI/documentURI case, right?

I don't really understand the difference between the document ID versus document URI cases to be sure. But from what you explained to me, it sounds like the document URI case kicks in when we are recursively scanning a directory and processing every file it contains, while the document ID case kicks in when we only feed a single file.

Given that, I would argue that the escape/unescape behaviour should be the same no matter which of the two cases is being run. For example, if I have a pipeline that ends with a JCasFileWriter_ImplBase, then the escaping/unescaping behaviour should be the same no matter if I invoke the pipe:

Alain

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/pull/1394?email_source=notifications&email_token=AAIMA4FSARDREN42XSDFGT3P752CBA5CNFSM4ID7XKBKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB6YVMEQ#pullrequestreview-263280146, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4GQW4JN6HKCUQYHR6DP752CBANCNFSM4ID7XKBA .

reckart commented 5 years ago

The URI case is the most common case - the documentID case is the odd case.

Most reader components (in any case those based on ResourceCollectionReaderBase) set the baseURI/documentURI fields. But there are some cases where no URI is available and where it may not be sensible to construct one, e.g. when reading data from a database. For these cases, we have collectionID/documentID.

So you see, it is not directly related to processing single file or multiple files. That said, processing folders recursively is only supported when using the URIs, not when using collectionID/documentID.

alaindesilets commented 5 years ago

Let me try to wrap my head around the two use cases.

The URI case is the most common case - the documentID case is the odd case.

Most reader components (in any case those based on ResourceCollectionReaderBase set the baseURI/documentURI fields.

That's what I thought. And from what I have seen, in those cases, the URI will have been URI encoded, right? So in this case, it makes sense to always decode the file name in the writer.

In that particular use case, would there be any call for setting a parameter to ask for the filename to be URI encoded? I don't think so.

But there are some cases where no URI is available and where it may not be sensible to construct one, e.g. when reading data from a database.

I get that. And I presume that in this case, the document ID will not be URI encoded, so there is no need to decode it.

Moreover, given that the docID may have been created by code that didn't know it would eventually be used to create a file name, it could be that the id contains characters that are illegal in a file name.

Given that, it seems to me that we should ALWAYS escape illegal file characters from the docID in order to create a file name.

So here also, I am not sure why we need an option to say when to encode the docID, because we will always be encoding it in the docID case, and never encoding it in the URI case.

Also, it seems to me that URI encoding the file name in the docID is not quite what we want, because spaces are valid file name characters even if they are not valid URI characters.

So here is what I propose:

Does that make sense?

reckart commented 5 years ago

Jenkins, can you test this please?

ukp-svc-jenkins commented 5 years ago

69% (-4.79%) vs master 73%