apache / incubator-answer

A Q&A platform software for teams at any scales. Whether it's a community forum, help center, or knowledge management platform, you can always count on Apache Answer.
https://answer.apache.org
Apache License 2.0
13k stars 1.01k forks source link

Uploaded images become inaccessible after the domain was changed. #861

Open Octobug opened 8 months ago

Octobug commented 8 months ago

Describe the bug

Uploaded images become inaccessible after the domain was changed. I only tested it in development env.

To Reproduce

Steps to reproduce the behavior:

  1. Under domain.old, you uploaded your custom avatar.
  2. You updated domain.old to domain.new, and the uploaded avatar became inaccessible.
    • Domain changes are very common.

Expected behavior

Uploaded files should serve well no matter what the domain is.

Screenshots

image image

Platform

LinkinStars commented 8 months ago

Thanks for the feedback. It seems that the storing avatars relies on absolute paths. Modifying it to a new domain could potentially introduce new issues. Saving them as relative paths might be a better option. 🤔

Octobug commented 8 months ago

Thanks for the feedback. It seems that the storing avatars relies on absolute paths. Modifying it to a new domain could potentially introduce new issues. Saving them as relative paths might be a better option. 🤔

@LinkinStars Yeah, uploaded files should be stored as relative paths. And, for those already deployed in production, the fixing should implement a mechanism to automatically update existing data.

Octobug commented 8 months ago

@LinkinStars I found that all images in a question, answer or avatar have this issue. I've changed the issue title.

Octobug commented 8 months ago

I would like to solve this issue once the solution is settled. 👀

calmdev commented 8 months ago

Fwiw, I ran into this issue with logo and favicon as well under the branding settings when originally testing how to go about setting up answer.

LinkinStars commented 8 months ago

Thanks for the feedback. It seems that the storing avatars relies on absolute paths. Modifying it to a new domain could potentially introduce new issues. Saving them as relative paths might be a better option. 🤔

@LinkinStars Yeah, uploaded files should be stored as relative paths. And, for those already deployed in production, the fixing should implement a mechanism to automatically update existing data.

  • This could be done by an independent command line tool. or
  • During the installation. or
  • In the backend code, auto rewrite those stored paths when fetching them.

@Octobug Your suggestion is great. We have carefully discussed this issue for some time and believe that providing a command-line solution would be better.

For example, we can offer a functionality like this:

./answer fix image from A to B

This command can modify the domain name in existing data.

The reasons for this are as follows:

  1. Existing users may not prefer using relative paths to access images.
  2. Answer supports storage plugins, allowing users to upload images to other locations, such like S3 or OSS. In such cases, the saved image may have a different absolute domain path, such as https://s3.east-1.xxx.com/. Of course, this is just a possibility. By providing this command-line tool, we can assist users in making the switch when they want to change their storage and update this part of the address.

Additionally, there is another requirement. We believe it would be helpful to provide an environment variable, ABSOLUTE_IMAGE_PATH, which users can set to true. In this way, Answer will default to using absolute paths for storing images. This consideration is made for existing users who may copy the content to other places where the images won't display if relative paths are used.

Once again, thank you very much for suggesting this feature because we believe it could pose a little big challenge. What do you think? If there are any unclear points, we can continue the discussion.

Octobug commented 8 months ago

@LinkinStars Thank you for your detailed response!

⚠️ I haven't read much of the relevant code yet so I might miss some background knowledge of Answer. I'll make a summary to see if I have understood correctly.

I am also in favor of building this functionality as a ./answer command option.

And let's split a URL into 2 major parts for the convenience in describing:

I think the ABSOLUTE_IMAGE_PATH could be renamed to IMAGE_PATH_PREFIX (if there will be other types of files in the future, maybe it could be UPLOADS_PATH_PREFIX).

There are 2 major types of uploaded files:

  1. For any external storage provided by 3rd-party services (S3, OSS, ...), files are stored as PROTOCOL://HOSTNAME:PORT + PATH in DB with no doubt.
  2. For files stored locally, application codes can directly use IMAGE_PATH_PREFIX + RELATIVE_PATH, only RELATIVE_PATH is stored in DB. This usage of IMAGE_PATH_PREFIX can be delivered by documentation:
    • 2.1. By default, the value of IMAGE_PATH_PREFIX is /
    • 2.2. For files need to be moved to a new directory, the value of IMAGE_PATH_PREFIX is BASE_PATH (e.g. new/dirname/)
    • 2.3. For files need to be moved to another self-hosted place different from the Answer instance, the value of IMAGE_PATH_PREFIX is PROTOCOL://HOSTNAME:PORT + BASE_PATH (e.g. https://statics.self-hosted.com/uploads/). Update: Uploading would be a problem. Better not to support it.

Pseudocode:

// When fetching existing data to rewrite:
getImageURL(image) {
    if (isLocalFile(image)) { // an image has a property to be distinguished. e.g. !startsWith('http')
        return IMAGE_PATH_PREFIX + image.path;
    } else {
        return image.path;
    }
}

As for the ./answer fix image from a to b command, I think the from a and to b args are not needed. Because what are stored in DB are always RELATIVE_PATHs.

Update: for files stored in external places, these args are needed.

- Or, are the two args used for auto-migrating uploaded files (mv SRC DST)? - The description above assumes that an admin move uploaded files manually.

Update:

Octobug commented 8 months ago

@LinkinStars

This task seems heavy for a Go & Answer beginner (exactly me 🤣). If the coding job is real complicated under your estimation, I think it's better be done by your team.

LinkinStars commented 8 months ago

@Octobug

I think the ABSOLUTE_IMAGE_PATH could be renamed to IMAGE_PATH_PREFIX (if there will be other types of files in the future, maybe it could be UPLOADS_PATH_PREFIX).

I agree. IMAGE_PATH_PREFIX would be better.

Update:

IMAGE_PATH_PREFIX is used to fix the existing problem: After this problem is fixed, domain changes don't need any extra operations. ./answer migrate image from a to b is used to provide a migration feature: When you change external storage type, migration is inevitable.

That's a good summary. You got it.

This task seems heavy for a Go & Answer beginner (exactly me 🤣). If the coding job is real complicated under your estimation, I think it's better be done by your team.

Take it easy. You can try your best, and we will help you solve any problems together. 💪

Octobug commented 8 months ago

Take it easy. You can try your best, and we will help you solve any problems together. 💪

@LinkinStars Okay then, I'll give it another try.

Octobug commented 8 months ago

@LinkinStars

I have another thought on external storage type...

If the uploads dataset is huge, the migration could be very time-consuming because it needs to query & filter out all urls of avatars and urls in texts of questions & answers, and then write them back to DB.

What if Answer saves all types of uploads with relative paths with a prefix marker? Like: :NUM:/uploads/post/57pxxvykQFj.jpeg

:NUM: is a storage url prefix version number. And the version number maps to a URL prefix string, which is used to replace the :NUM: marker in a relative path stored in DB.

The default value of NUM is 0 mapping to "/", it deals with local uploads type. When an Answer admin decides to move the uploads directory to another path, the admin just needs to change 0 to map to "/new/path" after files are settled down.

If the admin adds an external storage, since there is a visit_url_prefix config field for every storage plugin, Answer knows if it is a new prefix. If it is, it increases NUM, and set it as active so that any subsequent uploads use the new active NUM.

Let's say there is a new type of site_info called uploads, it looks like:

{
    "url_prefix_versions": {
        "0": {
            // ":0:/uploads/post/57pxxvykQFj.jpeg"
            "active": true,
            "value": "/"
        },
        "1": {
            // ":1:/uploads/post/57pxxvykQFj.jpeg"
            "active": false,
            "value": "https://s3.east-1.xxx.com/"
        },
        "2": {
            // ":2:/uploads/post/57pxxvykQFj.jpeg"
            "active": true,
            "value": "https://abc.oss-cn-hangzhou.aliyuncs.com/"
        },
        ...
    },

    // other properties...
}

With this solution, those relative paths hardly need to be changed.

The basic migrate operation process is:

  1. The admin copies their uploads to another place.
  2. Then the admin updates the corresponding NUM's mapping.
  3. The admin now can delete uploads in the old place.

And still, ./answer migrate uploads from_a to_b could be treated as the last resort to rewrite paths in DB. And if the version number solution is adopted, this migrate functionality is better to support both command line and the admin UI. It can provide migration status like progress, affected number of urls, etc.

However, migration of large dataset may not be common. The consideration here is whether it is worth implementing such a feature to address a rare scenario.

If this idea is rejected, I will try to implement the previously discussed solution.

LinkinStars commented 8 months ago

@Octobug First of all, thank you very much for presenting the new proposal. We have had a detailed discussion regarding your proposal. It is certain that using external storage for migration would make it somewhat easier.

However, we still have some considerations:

  1. For regular users, this additional configuration would require extra understanding, thus increasing the learning curve for them.
  2. There would be an additional number in the path, which might appear slightly odd to users.
  3. As you also mentioned, this migration is a very infrequent operation and might only be performed once. This point convinced us.

Therefore, based on the reasons stated above, we would prefer to keep the process simple, and thus we suggest sticking with the previous solution would be better.

Octobug commented 8 months ago

@LinkinStars Okay, it was just a thought. 👀

I would like to confirm some details:

The env var

Does this var need to be supported both in Config File and Environment Variable?

It actually affects all kinds of uploads, so I think upload is more proper. And it may include PROTOCOL://HOSTNAME:PORT, so UPLOAD_URL_PREFIX is better than UPLOAD_PATH_PREFIX.

The command

./answer fix upload PREFIX_SRC PREFIX_DST
LinkinStars commented 8 months ago

@LinkinStars Okay, it was just a thought. 👀

I would like to confirm some details:

The env var

Does this var need to be supported both in Config File and Environment Variable?

  • In config file: service_config.upload_url_prefix

    • Used when site is running.
  • Environment Variable: UPLOAD_URL_PREFIX

    • Used for installation
  • Default value: "/"

It actually affects all kinds of uploads, so I think upload is more proper. And it may include PROTOCOL://HOSTNAME:PORT, so UPLOAD_URL_PREFIX is better than UPLOAD_PATH_PREFIX.

The command

./answer fix upload PREFIX_SRC PREFIX_DST

I can't agree more. 👍 Yes, supported both in config file and env would be better. So UPLOAD_URL_PREFIX is better than UPLOAD_PATH_PREFIX.

Octobug commented 8 months ago

Additionally, there is another requirement. We believe it would be helpful to provide an environment variable, ABSOLUTE_IMAGE_PATH, which users can set to true. In this way, Answer will default to using absolute paths for storing images. This consideration is made for existing users who may copy the content to other places where the images won't display if relative paths are used.

@LinkinStars I made some attempts on the var part. The consideration you mentioned before (I made it bold & italic) seemd hard to achieve: The var (UPLOAD_URL_PREFIX) is closely associated with service_config.upload_path.

If using a symlink pointing to the locally moved uploaded files is acceptable (tell admins to do so in the documentation), the var seems to be unnecessary at all. Answer is able to distinguish these two types of uploads without an env var:

LinkinStars commented 8 months ago

@Octobug 🤔 I got it. So, for now, we can consider only three scenarios:

This consideration is made for existing users who may copy the content to other places where the images won't display if relative paths are used. That would simplify the problem. What do you think?

Octobug commented 8 months ago

@Octobug 🤔 I got it. So, for now, we can consider only three scenarios:

  1. Using relative paths: UPLOAD_URL_PREFIX defaults to "/"
  2. Using absolute paths with the website domain: UPLOAD_URL_PREFIX is "http://answer.cn:8080/"
  3. Using absolute paths with a third-party service: UPLOAD_URL_PREFIX is "https://s3.east-1.xxx.com/" (Of course, this scene is already handled by the plugin. Just an example.)

~This consideration is made for existing users who may copy the content to other places where the images won't display if relative paths are used.~ That would simplify the problem. What do you think?

@LinkinStars

Right, there are three scenarios for now. But scenario 2. seems weird 🤔️. What benefits does storing absolute URLs bring (we don't have disagreement on 3.)? The only benefit I can think of is that admins can use a reverse proxy server (like Nginx) to route the requests of these absolute URLs to where those uploaded files are (maybe another host different from Answer's host).

LinkinStars commented 8 months ago

@Octobug 🤔 I got it. So, for now, we can consider only three scenarios:

  1. Using relative paths: UPLOAD_URL_PREFIX defaults to "/"
  2. Using absolute paths with the website domain: UPLOAD_URL_PREFIX is "http://answer.cn:8080/"
  3. Using absolute paths with a third-party service: UPLOAD_URL_PREFIX is "https://s3.east-1.xxx.com/" (Of course, this scene is already handled by the plugin. Just an example.)

~This consideration is made for existing users who may copy the content to other places where the images won't display if relative paths are used.~ That would simplify the problem. What do you think?

@LinkinStars

Right, there are three scenarios for now. But scenario 2. seems weird 🤔️. What benefits does storing absolute URLs bring (we don't have disagreement on 3.)? The only benefit I can think of is that admins can use a reverse proxy server (like Nginx) to route the requests of these absolute URLs to where those uploaded files are (maybe another host different from Answer's host).

@Octobug 👍 What you said is absolutely correct. We are indeed concerned about the existence of such users. As far as I know, there are users who deploy their frontend and backend separately, and as you mentioned, images are accessed through nginx proxy. Of course, this is only do for compatibility reasons. This is also why it is treated as an environment variable rather than a configuration item placed in the admin page.

Octobug commented 8 months ago

@LinkinStars Okay, thank you for your patience! In this case, it seems that your first solution is better. :D That is, make the var a boolean. Although the name may need to be changed from ABSOLUTE_IMAGE_PATH to:

LinkinStars commented 8 months ago
  • UPLOAD_USE_ABSOLUTE (I prefer this one)

@Octobug 😄 Maybe we can use UPLOAD_USE_ABSOLUTE_URL. Thanks for your patience as well. This whole discussion has made the whole issue more clear.

Octobug commented 8 months ago
  • UPLOAD_USE_ABSOLUTE (I prefer this one)

@Octobug 😄 Maybe we can use UPLOAD_USE_ABSOLUTE_URL. Thanks for your patience as well. This whole discussion has made the whole issue more clear.

@LinkinStars Aha, okay. I'll submit a PR a few days later together with the fix command.

Octobug commented 6 months ago

@LinkinStars Here's my version of the fix command, I would like to know your opinion. 👀

answer fix upload SRC_PREFIX DST_PREFIX [--continue] [--offline|online] [--dry-run] [--limit NUM]

When the fix command starts, it checks if there is a tmp file: .fix_upload_interrupted.

  1. If there isn't, it generates one.
  2. If there is, it tells the user that the previous fix was interrupted, and the user needs to remove the file if they are to start a fresh fix, or uses the --continue switch to continue.
  3. When there isn't any data left that matches the SRC_PREFIX, it removes .fix_upload_interrupted then exit.

Switches:


Updated at 2024-07-17 22:40

It makes little sense to lock a data row no matter it's an online or offline fix. Because you cannot make sure that there aren't any web UI editors opened with old data loaded in it. Even after an offline fix is finished, there might still be a user going to submit the old version data.

So, neither way is perfect. This corner case can only be addressed by the documentation work I guess.

LinkinStars commented 6 months ago

@Octobug Very thoughtful consideration. I have only one suggestion. I recommend that this command be run offline, but it's not mandatory. It would be advisable to include instructions in the documentation, suggesting users to run it offline. Otherwise, it would require handling various unexpected concurrency situations, such as table locking, which would make the implementation more complex. 🤔️

Octobug commented 6 months ago

@Octobug Very thoughtful consideration. I have only one suggestion. I recommend that this command be run offline, but it's not mandatory. It would be advisable to include instructions in the documentation, suggesting users to run it offline. Otherwise, it would require handling various unexpected concurrency situations, such as table locking, which would make the implementation more complex. 🤔️

Okay. I have the feeling that I will give up the --online switch. :D