OpenClinica / enketo-oc

OpenClinica's fork of the Enketo web forms monorepo
Apache License 2.0
0 stars 5 forks source link

Fixed upload file/image with character # in filename #205

Closed gushil closed 2 months ago

gushil commented 6 months ago

Closes #204

I have verified this PR works with

What else has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Do we need any specific form for testing your changes? If so, please attach one.

gushil commented 5 months ago

What is the actual bug (how is the filename with # treated?), and what causes it (in Enketo's code)?

The issue is existed when we are editing the form.

Uploading the file (Submitted successfully) image

Close the form and load the form again image

The problem is because the file with # char is not saved into instanceAttachments correctly.

We can see here the part after # is cutted off image

MartijnR commented 5 months ago

We can see here the part after # is cutted off

Where in Enketo is it cut off?

gushil commented 5 months ago

We can see here the part after # is cutted off

Where in Enketo is it cut off?

instanceAttachments keys should be full file name with postfix. Because it is cut off, when the form loaded and trying to get the right key, we can't get it and we can't get the right file from instanceAttachments.

Thanks.

MartijnR commented 5 months ago

Because it is cut off, when the form loaded

So data-loaded-file-name is set incorrectly somewhere? Where? Is it here or somewhere else? Why is it cut off? Is value not correct? If so, why?

gushil commented 5 months ago

Hi @MartijnR

I finally can attach the file attachment through curl and found the problem is caused by enketo-transformer escapeURLPath method that is called by enketo-express media createMediaURL and escapeFileName methods in getMediaMapmethod.

Should we modify escapeURLPath or is it better to modify either createMediaURL and escapeFileName to not use escapeURLPath ?

Thanks.

MartijnR commented 5 months ago

Great find! That made me suspect this issue may also be present in enketo/enketo, and I quickly confirmed that using kobotoolbox.org online. I filed the issue here: https://github.com/enketo/enketo/issues/1324

It would be great if you make your PR there and then we merge it here (or if we cannot wait for approval create PRs for both repos).

Should we modify escapeURLPath or is it better to modify either createMediaURL and escapeFileName to not use escapeURLPath

I'm not sure. What is the escapeURLPath method turning the URL with# into?

gushil commented 5 months ago

Great find! That made me suspect this issue may also be present in enketo/enketo, and I quickly confirmed that using kobotoolbox.org online. I filed the issue here: enketo#1324

It would be great if you make your PR there and then we merge it here (or if we cannot wait for approval create PRs for both repos).

Should we modify escapeURLPath or is it better to modify either createMediaURL and escapeFileName to not use escapeURLPath

I'm not sure. What is the escapeURLPath method turning the URL with# into?

@MartijnR

With mediaPath = /media/get/1/a/ECG#1.png, transformer.escapeURLPath(mediaPath) turned mediaPath into /media/get/1/a/ECG

Thanks.

MartijnR commented 5 months ago

Probably best to figure out what the intention of the escapeURLPath function is (maybe initially intended for form media?), and decide based on that.

MartijnR commented 5 months ago

@gushil, are you sure we are passing a correct URL in the API call without a fragment identifier/hash component?

I'm suddenly wondering if the bug is not in Enketo but in OpenClinica and KoBoToolbox due to not encoding special URL characters in the filenames when making the API call to Enketo.....

incorrect:

Screenshot 2024-06-17 at 5 38 20 PM

correct:

Screenshot 2024-06-17 at 5 39 19 PM
gushil commented 5 months ago

Hi @MartijnR

I've tested the issue with enketo-oc and centro in my local setup, and the behaviour is similar with the one deployed in the server.

Thanks.

MartijnR commented 5 months ago

@gushil, so if I understand you correctly, you are already url-encoding the filename part of the URL you are sending to Enketo in the instance_attachments part of your curl request when reproducing this with Centro?

gushil commented 5 months ago

No. I just send the file name as it is like what happened when it is manually uploaded with the enketo ui because when I debug uploading the file in enketo ui, no url-encoding happened.

Am I wrong doing that?

Really appreciated your suggestion.

Thanks.

On Wed, Jun 19, 2024, 18:19 Martijn van de Rijdt @.***> wrote:

@gushil https://github.com/gushil, so if I understand you correctly, you are already url-encoding the filename part of the URL you are sending to Enketo in the instance_attachments part of your curl request when reproducing this with Centro?

— Reply to this email directly, view it on GitHub https://github.com/OpenClinica/enketo-oc/pull/205#issuecomment-2178427616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOX627R6BSWNJKGBMB2JTZIFSLPAVCNFSM6AAAAABIWXG7WGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZYGQZDONRRGY . You are receiving this because you were mentioned.Message ID: @.***>

MartijnR commented 5 months ago

The ‘#’ is a reserved character (for a fragment identifier ) in a URI, so it needs to be encoded when it is part of a URL (e.g. in the API request as the value for an instance_attachment item). Enketo doesn’t send it as a URL during submission so doesn’t have to encode it. Did you look closely at the 2 screenshots I posted?

If this is not clear please send the complete curl snippet you are using and I will suggest a change.

gushil commented 5 months ago

Hi @MartijnR

Yes, I saw your screenshots. but I wonder if we upload the file to the form, is it uploading with full path (like https://openclinica.com/media/get/1/a/ECG#1.png) ?

Thanks.

MartijnR commented 5 months ago

but I wonder if we upload the file to the form, is it uploading with full path (like https://openclinica.com/media/get/1/a/ECG#1.png) ?

That URL is not crafted by Enketo. The URL is provided by OpenClinica when it makes the API call (using instance_attachments) Enketo just proxies that URL via the Enketo server (adding the media/get etc).

Enketo only submits the filename. So it's up to OpenClinica to provide a correct URL (which we're now realizing should have a url-encoded filename). OpenClinica decides to make the filename part of the URL. Enketo doesn't require that. It could be any URL (but it makes sense to make the filename part of that URL).

I recommend you first spent 1 minute testing this in your cURL snippet (replacing # with %23 in the instance_attachments URL), before continuing this discussion.... Then at least we know we are on the right track!! (I don't have a properly crafted cURL snippet handy myself that reproduces the issue).

MartijnR commented 5 months ago

Oh @gushil, I think I finally see what you mean! That code has completely changed, and that media/get URL is actually created by Enketo! Is the file stored in the redis database correctly if the URL provided in the API request is wrong (which it still is)?

gushil commented 5 months ago

Hi @MartijnR

With this form (centro/storage/forms/comment_repeat.xml) <?xml version="1.0"?><h:html xmlns="http://www.w3.org/2002/xforms" xmlns:OpenClinica="http://openclinica.com/odm" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:oc="http://openclinica.org/xforms" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><h:head><h:title>Comment repeat</h:title><model odk:xforms-version="1.0.0"><instance><data id="comment_repeat" version="version-1"><group_yt6qb39 jr:template=""><comments/><comments_001/></group_yt6qb39><group_yt6qb39><comments/><comments_001/></group_yt6qb39><upload_image/><meta><instanceID/></meta></data></instance><bind nodeset="/data/group_yt6qb39/comments" oc:itemgroup="RG1" type="string"/><bind nodeset="/data/group_yt6qb39/comments_001" oc:itemgroup="RG1" type="string"/><bind nodeset="/data/upload_image" oc:itemgroup="RG2" type="binary"/><bind jr:preload="uid" nodeset="/data/meta/instanceID" readonly="true()" type="string"/></model></h:head><h:body><group ref="/data/group_yt6qb39"><label></label><repeat nodeset="/data/group_yt6qb39"><input ref="/data/group_yt6qb39/comments"><label>Comments 1:</label></input><input ref="/data/group_yt6qb39/comments_001"><label>Comments 2:</label></input></repeat></group><upload mediatype="image/*" ref="/data/upload_image"><label>Upload image</label></upload></h:body></h:html>

and this image url: image

I'm using this cURL snippet:

curl --user enketorules: -d "server_url=http://localhost:3000&form_id=comment_repeat&ecid=1&instance_id=a&instance=\
    <data xmlns:OpenClinica=\"http://openclinica.com/odm\" xmlns:enk=\"http://enketo.org/xforms\" xmlns:jr=\"http://openrosa.org/javarosa\" xmlns:oc=\"http://openclinica.org/xforms\" xmlns:orx=\"http://openrosa.org/xforms\" id=\"edit_comment_repeat\" version=\"test\">\
    <group_zt8fu31_002>\
        <Patient_Signature_002_001>ECG.png</Patient_Signature_002_001>\
    </group_zt8fu31_002>\
    <meta>\
        <instanceID>uuid:376611b3-1506-4c11-b21a-3a246f6574f2</instanceID>\
    </meta>\
    </data>&instance_attachments[ECG#1.png]=http://localhost:8000/ECG%231.png" http://localhost:8005/oc/api/v1/instance/edit/c

that returns "url": "http://localhost:8005/edit/fs/c/i/05244e567c9854acdd6a22819a3fa573?ecid=1&instance_id=a"

When loading the form edit url http://localhost:8005/edit/fs/c/i/05244e567c9854acdd6a22819a3fa573?ecid=1&instance_id=a

I got this:

image

gushil commented 4 months ago

@MartijnR

You are right. Let me look into media lib now.

Thanks for the suggestion.

gushil commented 4 months ago

Hi @MartijnR

After a couple of trial, testing and contemplating. I think this is the simplest solution that works and doesn't need a lot of changes.

These are the changes I made:

It doesn't need us to handle encoding of instanceAttachments because the filename in it is always URL-safe.

Please share your thought about it.

Thanks.

gushil commented 4 months ago

Hi @MartijnR

Thanks for your detailed answer.

Could you please help me what steps should I do to thouroghly make sure that all the cases are resolved? (I'm sorry for assumming that if all unit tests are passed, those mean OK)

Something like:

New records

Old records

Other case 1

Other case 2

Thanks!

MartijnR commented 4 months ago

I think something like these at least:

  1. open empty form with a regular file upload question and an annotate question (that URL is always the same you can request it for the first time with the /oc/api/v1/survey/collectAPI endpoint).
  2. upload files with a # in the filename into the form.
  3. check the submission to see that the filename of the file that is sent is the same as the filename in the XML node.
  4. load the record that was submitted using a curl API request to generate the temporary edit URL
  5. Make sure the frontend/page displays the previously uploaded images, and that the filenames are the ones that were submitted in the 3rd step.

With respect to loading old records that contain images with unsafe characters in their filenames, best to check with your colleagues. I think it would be great if your fix worked for them keeping the original filenames but since this issue has so far only been reported by OpenClinica and is not absolutely critical, it may be acceptable to not cover that case.

The instance and the instance_attachment URLs provided in the /instance API request are stored in the redis cache for just 30 seconds. They have the prefix in:. What I do is immediately after making the API request, I use keys in:* to find the key in redis-cli. There will normally just be one result when testing locally. You can then inspect the entry with e.g. hgetall in:b (if in:b was the key you found, i.e. b was the instanceID)). You could of course also change to a longer expiration temporarily during development (here).

I don't think there are any tests for this behavior unfortunately. If you end up tweaking the media library or the transformer, I am hoping there will be helpful tests that will fail and show you the intended behavior that ends up stripping the hash parts of the keys and URLs. If you end up tweaking the urlencoded app middleware, perhaps some helpful tests may flag issues too, but am not sure.

gushil commented 4 months ago

P.S. 1 If you decide to go back to fixing the various decoding issues in Enketo, I'm wondering if you concluded that passing a custom qs (decoded) function to urlencoded is not feasible? It seems like it might be good if any URL passed in the API is not decoded at all (this would not be a solution for media.js/transformer issues, but just for the redis URL storage issue)

@MartijnR

I read the documentation and the source code of urlencoded (https://github.com/expressjs/body-parser/blob/master/lib/types/urlencoded.js) and found no way to pass a custom qs function.

Anything I missed?

gushil commented 4 months ago

I think something like these at least:

  1. open empty form with a regular file upload question and an annotate question (that URL is always the same you can request it for the first time with the /oc/api/v1/survey/collectAPI endpoint).
  2. upload files with a # in the filename into the form.
  3. check the submission to see that the filename of the file that is sent is the same as the filename in the XML node.
  4. load the record that was submitted using a curl API request to generate the temporary edit URL
  5. Make sure the frontend/page displays the previously uploaded images, and that the filenames are the ones that were submitted in the 3rd step.

With respect to loading old records that contain images with unsafe characters in their filenames, best to check with your colleagues. I think it would be great if your fix worked for them keeping the original filenames but since this issue has so far only been reported by OpenClinica and is not absolutely critical, it may be acceptable to not cover that case.

The instance and the instance_attachment URLs provided in the /instance API request are stored in the redis cache for just 30 seconds. They have the prefix in:. What I do is immediately after making the API request, I use keys in:* to find the key in redis-cli. There will normally just be one result when testing locally. You can then inspect the entry with e.g. hgetall in:b (if in:b was the key you found, i.e. b was the instanceID)). You could of course also change to a longer expiration temporarily during development (here).

I don't think there are any tests for this behavior unfortunately. If you end up tweaking the media library or the transformer, I am hoping there will be helpful tests that will fail and show you the intended behavior that ends up stripping the hash parts of the keys and URLs. If you end up tweaking the urlencoded app middleware, perhaps some helpful tests may flag issues too, but am not sure.

I finally can understand and replicate what you mean to test the submission and editing form. I found that my solution still not fixing all the problems. Still working on it.

Thanks.

MartijnR commented 4 months ago

I read the documentation and the source code of urlencoded (https://github.com/expressjs/body-parser/blob/master/lib/types/urlencoded.js) and found no way to pass a custom qs function.

Anything I missed?

Oh, maybe I misread... I thought passing extended: true would allow you to do that. Sorry!

gushil commented 4 months ago

Hi @MartijnR

I've tried both ways:

  1. Keep the widget intact, but modifying instanceAttachments and media for form edit handling. It made a lot of broken tests and involve editing a lot of files.
  2. Modifying both widget and instanceAttachments form editing handling.

I choose the second way because it involve editing less files and all tests are passed.

Currently testing annotate widget. Meanwhile, could you please review it?

Thanks.

gushil commented 4 months ago

Still got this error on annotate widget, eventhough instanceAttacnhments and instance is correct.

image

gushil commented 4 months ago

Update. Annotate widget is working.

With this curl command for editing

curl --user enketorules: -d "server_url=http://localhost:3000&form_id=annotation_widget&ecid=11&instance_id=a1&instance=\
    <data xmlns:OpenClinica=\"http://openclinica.com/odm\" xmlns:jr=\"http://openrosa.org/javarosa\" xmlns:oc=\"http://openclinica.org/xforms\" xmlns:orx=\"http://openrosa.org/xforms\" id=\"annotation_widget\" version=\"1\" xmlns:enk=\"http://enketo.org/xforms\">\
    <image_annotation type=\"file\">annotation-20_43_14.png</image_annotation>\
    </data>&instance_attachments[annotation-20_43_14.png]=http://localhost:8000/annotation-20_43_14.png" http://localhost:8005/oc/api/v1/instance/edit/c

I got this

image

gushil commented 2 months ago

Closed this PR because new work has been created on PR #212