azkaban / azkaban-plugins

Plugins for Azkaban.
https://azkaban.github.io
Apache License 2.0
130 stars 178 forks source link

Add additional documentation for HtmlViewer plugin #288

Closed qil026 closed 6 years ago

qil026 commented 6 years ago

In HtmlFileViewer.java, added documentation to describe the intended usage and support for this plugin, alerted users that this is simply an experimental feature, not intended for all types of html files.

In hdfs-file-js.vm template file, added documentation to describe why iframe content has to be injected in that particular way.

HappyRay commented 6 years ago

Thanks.

HappyRay commented 6 years ago

Is your master branch update to date? It appears that some of the commits have been merged.

HappyRay commented 6 years ago

to introduce a different behavior means there would be additional if/else logic in AZ code which reduces the elegance and readability of the code.

Some refactoring may be needed. Sure the new code will have this new feature and some additional complexity may be unavoidable. I am not convinced that the result will be less readable necessarily.

2nd, if we get both the web skeleton and html content together (and put html content in srcdoc like you said), for large html files, user would see a blank page for a few seconds before he could see the entire page, and they may think the server has dead while waiting. Following the current AZ framework is more user friendly since it shows the skeleton page and a progress bar to indicate data is being downloaded, so user know the server is responsive and nothing is broken.

Do you have performance measurement to support this claim? I am not sure it is better for the existing code to rely on AJAX either.

3rd, it's not trivial to simply put content into the "srcdoc" attribute. If the html content is this:

Hello " World

, then putting it inside srcdoc will result in error: <iframe srcdoc="

Hello " World

". The quote in the h3 field will mess things up

I would expect an encoding library exists to make such task easy.

and allow sql injection. We can certainly escape such chars, but there's no guarantee that we can do it right or even cover everything we need.

Could you elaborate how this can result in SQL injection?

Overall I think fetching with an async call and write it into the iframe is the better option.

I am not convinced.

qil026 commented 6 years ago

more complexity is somewhat equivalent to less readable, at least that how I think.. I am less concerned about this actually. My major concern is the site speed.

You can open up any large avro file on azkaban today, you will see the progress bar showing for 10+ seconds before the content shows up. Similar thing for large html files. My current approach will show you that progress bar. If we switch to writing content into srcdoc directly in the original request without doing an ajax call, then the page will be completely blank for 10+ seconds (as the browser downloads data) before user can see anything.

I understand that writing into srcdoc can be beneficial such that we can completely sandbox the iframe without opening any access at all. I'm open to changing the logic to writing html source into srcdoc instead of writing it to the iframe's dom. Though I still think that fetching the source through a separate ajax call is better both user experience wise and for easier code maintenance.

What do you think?

HappyRay commented 6 years ago

Displaying large files this way is problematic. I would like to see different approaches such as limiting the amount of content that can be viewed this way or paging.

For your HTML use case, I doubt the file will be that big. It can be, but I am not sure if that is worth optimizing for and we should limit the server's expose to such risks. Plus the current design only supports basic HTML content. There is no need to transform the content. Displaying Avro files does.

I can understand if you don't want to take on this task.

I don't know if the alternative will work or not. Earlier you said it didn't work. A prototype with the alternative approach will be nice. I don't feel the design space has been properly explored enough.

qil026 commented 6 years ago

There is already a limit on the amount of content to be downloaded, currently set to 10MB I believe. What advantage would it bring by fetching everything in one shot instead of using current framework?

I can help take on this task, though i'm not convinced that this is advantageous or worth the extra work. If we agree on a solution, then yeah I will definitely revise it.

Let me experiment with the srcdoc attr, I've tried it locally and it seems working fine, but still need verification on holdem ei to see how it holds up.

HappyRay commented 6 years ago

Thanks for confirming that srcdoc attr worked. I don't expect the result to be different when connecting to HDFS.

Given that result, should your comment be updated?

Have you figured out how encoding/escaping works?

qil026 commented 6 years ago

After some testing, it looks like we don't have to worry about encoding/escaping, by writing the content into srcdoc directly using javascript, they are automatically "escaped". I've tried various test cases to break it, none worked, so currently it is pretty robust. On the other hand, I do find that writing content into srcdoc is functionally the same as writing content into iframe's dom structure directly.

Originally I thought writing into srcdoc can help remove the "allow-same-origin" sandbox exception policy, but after testing it out, I realized that we may still need it. The thing is, writing content into that attribute is fine. But afterwards, we need to resize the iframe to fit the content inside, this requires us to peek into the iframe's dom and figure out how tall and wide it is, without "allow-same-origin" policy, we can't do this. The next option is to make iframe fixed width and height with scroll bar in case of overflow. Width is easy, but height is a bit tricky as we don't know how high is appropriate for the content. Overall I think it might still make sense to include the "allow-same-origin" policy so we can resize the iframe, but then writing doc into srcdoc would be no different from current implementation. Ideas?

HappyRay commented 6 years ago

Thanks.

What I suggested was to compare server generated iframe content vs js injected content. Earlier you said server generated content didn't work. That's why I suggested trying the srcdoc attribute.

qil026 commented 6 years ago

I see, in that case I will leave this unchanged and update the description, please let me know if you still prefer to generate everything by the server without the async data fetch. Thanks!

HappyRay commented 6 years ago

We are thinking about rewriting the HDFS viewer. Given that, I am not going to insist that you explore better designs if you think the current is good enough.

Could you finish your documentation changes and ask @chengren311 to review again? Thanks