BratWriter does not deal properly with filenames that contain spaces

alaindesilets commented 5 years ago

If you have a bunch of Brat files (.txt and .ann) names contain some spaces and you do the followingL

Read them with BratReader
Write them with BratWriter

then the BratWriter will end up saving the .ann files with names that contain %20 instead of the spaces.

The solution would be to modify JCasFileWriter_ImplBase.getRelativePath() so that it URL decodes the file name before returning it.

reckart commented 5 years ago

If this decoding happens unconditionally, wouldn't it be a problem if the original file name contained %20 (and the expectation would be for that to be passed through verbatim)?

alaindesilets commented 5 years ago

Yes, that would be a problem. However I think this is a much less common scenario than the scenario where the file name was URI encoded before being passed to the writer. For one thing, any pipeline of of the following form will encounter that problem, if the file contains spaces in its name:

Read a file with a reader
Annotate it
Write it with a writer

On Tue, Jul 16, 2019 at 2:01 AM Richard Eckart de Castilho < notifications@github.com> wrote:

If this decoding happens unconditionally, wouldn't it be a problem if the original file name contained %20 (and the expectation would be for that to be passed through verbatim)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1392?email_source=notifications&email_token=AAIMA4CRBEOHC6RRBN4LFV3P7VP5NA5CNFSM4IDYPS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ7ZJ7Y#issuecomment-511677695, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4AQUON3HYL2M2I5P3LP7VP5NANCNFSM4IDYPS4Q .

alaindesilets commented 5 years ago

On Tue, Jul 16, 2019 at 6:39 AM Alain Désilets alaindesilets0@gmail.com wrote:

Yes, that would be a problem. However I think this is a much less common scenario than the scenario where the file name was URI encoded before being passed to the writer. For one thing, any pipeline of of the following form will encounter that problem, if the file contains spaces in its name:

Read a file with a reader

Annotate it

Write it with a writer

BTW.... with a pipeline of this sort, if a file originally has an actual %20 in it (say, a file called "get_20%_discout.txt") then there is no problem because the reader with transform the file name into "get_20%25_discount.txt" which the writer will convert back to "get_20%_discount.txt". The only situation where a problem could arise is (I think) if a pipeline creates a JCas without going through a reader, and sets the file name without URI escaping it.

On Tue, Jul 16, 2019 at 2:01 AM Richard Eckart de Castilho < notifications@github.com> wrote:

If this decoding happens unconditionally, wouldn't it be a problem if the original file name contained %20 (and the expectation would be for that to be passed through verbatim)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1392?email_source=notifications&email_token=AAIMA4CRBEOHC6RRBN4LFV3P7VP5NA5CNFSM4IDYPS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ7ZJ7Y#issuecomment-511677695, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4AQUON3HYL2M2I5P3LP7VP5NANCNFSM4IDYPS4Q .

reckart commented 5 years ago

Right. I thought we'd conditionally URL-encode the filename, but it looks like we always URL-encode, so unconditionally decoding it should be reasonable.

alaindesilets commented 5 years ago

As I am looking at getRelativePath() some more, I see that there is an if branch based on the value of useDocumentId.

The bottom branch (useDocumentId = true) does URLEncoder.encode(). But from what I have seen, the BaseURI and DocURI are already URI encoded. I don't know enough to tell if the result of getDocumentId() is already encoded or not. If it is already encoded, then you will end up with doubly encoded file name.

I commented out the encode line and ran the test and all pass. But I am not confident that the tests actually cover situations where this line of code would be needed.

Based on your knowledge of that code, should this line be left there?

On Tue, Jul 16, 2019 at 3:50 PM Richard Eckart de Castilho < notifications@github.com> wrote:

Right. I thought we'd conditionally URL-encode the filename, but it looks like we always URL-encode, so unconditionally decoding it should be reasonable.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1392?email_source=notifications&email_token=AAIMA4GSZAL7CTSJ3VZ7SWLP7YRA3A5CNFSM4IDYPS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2B6KMY#issuecomment-511960371, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4FLVTT5GNAGUXF3UELP7YRA3ANCNFSM4IDYPS4Q .

reckart commented 5 years ago

The JCasFileWriter_ImplBase used the document ID to construct the file name if baseUri/docUri are not available or if force via PARAM_USE_DOCUMENT_ID.

The document ID is by default URL-encoded, but this can be turned off via PARAM_ESCAPE_DOCUMENT_ID.

So for scenarios where one doesn't recursively process directories, there is currently already the option to set PARAM_USE_DOCUMENT_ID to true and PARAM_ESCAPE_DOCUMENT_ID to false and this should cause spaces in original filenames to be preserved in output filenames.

But for recursive processing (using baseURI / docURI), we observe the problem you mention, namely that spaces (and other characters) are unconditionally escaped. So we could unconditionally unescape the URIs in the recursive processing case only. We could also add another parameter like PARAM_UNESCAPE_URI to control unescaping in this case. Or we could rename the PARAM_ESCAPE_DOCUMENT_ID to PARAM_ESCAPE_FILENAME and consider it for both, the recursive processing case as well as the document ID case.

alaindesilets commented 5 years ago

Given that this is a FILE writer (as opposed to something that might write to a remote URI), I wonder what the point of URI encoding the file names in any circumstances.

But assuming there are use cases where we want the file names to be URI encoded, then I guess it should be done uniformly whether we are recursively scanning a directory or just processing a single file.

So I vote for your last suggestion, namely:

rename the PARAM_ESCAPE_DOCUMENT_ID to PARAM_ESCAPE_FILENAME and consider it for both, the recursive processing case as well as the document ID case.

Additionally, I vote for making the default value of this parameter be false, because I suspect the more common use case will be one where the dev wants to keep the file names as they were originaly, i.e. not URI encoded.

Alain

On Wed, Jul 17, 2019 at 7:08 AM Richard Eckart de Castilho < notifications@github.com> wrote:

The JCasFileWriter_ImplBase used the document ID to construct the file name if baseUri/docUri are not available or if force via PARAM_USE_DOCUMENT_ID.

The document ID is by default URL-encoded, but this can be turned off via PARAM_ESCAPE_DOCUMENT_ID.

So for scenarios where one doesn't recursively process directories, there is currently already the option to set PARAM_USE_DOCUMENT_ID to true and PARAM_ESCAPE_DOCUMENT_ID to false and this should cause spaces in original filenames to be preserved in output filenames.

But for recursive processing (using baseURI / docURI), we observe the problem you mention, namely that spaces (and other characters) are unconditionally escaped. So we could unconditionally unescape the URIs in the recursive processing case only. We could also add another parameter like PARAM_UNESCAPE_URI to control unescaping in this case. Or we could rename the PARAM_ESCAPE_DOCUMENT_ID to PARAM_ESCAPE_FILENAME and consider it for both, the recursive processing case as well as the document ID case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1392?email_source=notifications&email_token=AAIMA4EQPSI5577XHUX4RTDP734UDA5CNFSM4IDYPS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2D3BUQ#issuecomment-512209106, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4BJMEDX27IMI2P3ELDP734UDANCNFSM4IDYPS4Q .

alaindesilets commented 5 years ago

I pushed changes to the pull request branch that implements that approach. If you do not like it, let me know and I will change it again.

On Wed, Jul 17, 2019 at 11:15 AM Alain Désilets alaindesilets0@gmail.com wrote:

Given that this is a FILE writer (as opposed to something that might write to a remote URI), I wonder what the point of URI encoding the file names in any circumstances.

But assuming there are use cases where we want the file names to be URI encoded, then I guess it should be done uniformly whether we are recursively scanning a directory or just processing a single file.

So I vote for your last suggestion, namely:

rename the PARAM_ESCAPE_DOCUMENT_ID to PARAM_ESCAPE_FILENAME and consider it for both, the recursive processing case as well as the document ID case.

Additionally, I vote for making the default value of this parameter be false, because I suspect the more common use case will be one where the dev wants to keep the file names as they were originaly, i.e. not URI encoded.

Alain

On Wed, Jul 17, 2019 at 7:08 AM Richard Eckart de Castilho < notifications@github.com> wrote:

The JCasFileWriter_ImplBase used the document ID to construct the file name if baseUri/docUri are not available or if force via PARAM_USE_DOCUMENT_ID.

The document ID is by default URL-encoded, but this can be turned off via PARAM_ESCAPE_DOCUMENT_ID.

So for scenarios where one doesn't recursively process directories, there is currently already the option to set PARAM_USE_DOCUMENT_ID to true and PARAM_ESCAPE_DOCUMENT_ID to false and this should cause spaces in original filenames to be preserved in output filenames.

But for recursive processing (using baseURI / docURI), we observe the problem you mention, namely that spaces (and other characters) are unconditionally escaped. So we could unconditionally unescape the URIs in the recursive processing case only. We could also add another parameter like PARAM_UNESCAPE_URI to control unescaping in this case. Or we could rename the PARAM_ESCAPE_DOCUMENT_ID to PARAM_ESCAPE_FILENAME and consider it for both, the recursive processing case as well as the document ID case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1392?email_source=notifications&email_token=AAIMA4EQPSI5577XHUX4RTDP734UDA5CNFSM4IDYPS42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2D3BUQ#issuecomment-512209106, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4BJMEDX27IMI2P3ELDP734UDANCNFSM4IDYPS4Q .

reckart commented 5 years ago

Given that this is a FILE writer (as opposed to something that might write to a remote URI), I wonder what the point of URI encoding the file names in any circumstances.

Right, it is a file writer, but the origin of the data is not always a file. Anyway, let's make the change and if somebody stumbles over it, we can reconsider. I would agree that for most people, unescaping is more likely to be the expected scenario.

dkpro / dkpro-core

BratWriter does not deal properly with filenames that contain spaces #1392