galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.39k stars 1k forks source link

uploading xml files with HTML content yields "The uploaded file contains invalid HTML content" #7121

Open fbergmann opened 5 years ago

fbergmann commented 5 years ago

Hello,

in trying to bring our software (COPASI, http://copasi.org) so we can use it from Galaxy I've encountered the issue, that uploading our supported file formats fails with:

The uploaded file contains invalid HTML content

these are either SBML files (http://sbml.org), or COPASI files. Each files that are pretty much XML, in which elements can be annotated using HTML notes.

I've tried to use the galaxy sbml type for SBML, for COPASI both the 'xml' and the 'cps' datatypes. I've noticed this does happen not just in my own local instance, but also on the usegalaxy.eu server.

This seems to happen in galaxy.datatypes.sniff:

https://docs.galaxyproject.org/en/latest/_modules/galaxy/datatypes/sniff.html

Any help would be appreciated

nsoranzo commented 5 years ago

@fbergmann If you can provide example SBML and COPASI files that would be helpful.

bgruening commented 5 years ago

@fbergmann does this mean that if you upload files and select sbml that you get this error as well?

It seems we do not have an example file in our repo. Can you provide us with a tiny valid example file, please?

fbergmann commented 5 years ago

Hello, of course, i should have attached some in the first place: in our repository under:

https://github.com/copasi/COPASI/tree/develop/TestSuite/distribution

are test files (brusselator-model.xml SBML, brusselator.cps COPASI) that i would have expected to load without issues.

@bgruening and yes, i manually selected cps and sbml and got this error.

nsoranzo commented 5 years ago

For both files the issue seems to be the line <meta name="qrichtext" content="1"/> which is flagged by https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/util/checkers.py#L36

fbergmann commented 5 years ago

for both formats, that tag (as well as the href which would have flagged) are valid.

nsoranzo commented 5 years ago

@fbergmann Sure, at the moment I'm just investigating where this error is coming from and documenting it here.

The original code for check_html() is from commit 3f5da01132f5ed507e2f8197a5e761087ec0f6c6 , it may be time for a rethink.

fbergmann commented 5 years ago

I've upgraded to the latest dev version, and I am still having this issue.

fbergmann commented 5 years ago

Ideally, i would have an option to declare that certain data types would not need to be checked, for now I've modified my local instance to not perform this test on contents for my XML dialects, but that seems like a hack:

--- a/lib/galaxy/util/checkers.py
+++ b/lib/galaxy/util/checkers.py
@@ -30,6 +30,7 @@ def check_html(file_path, chunk=None):
         temp = chunk.splitlines()
     else:
         temp = chunk
+
     regexp1 = re.compile(r"<A\s+[^>]*HREF[^>]+>", re.I)
     regexp2 = re.compile(r"<IFRAME[^>]*>", re.I)
     regexp3 = re.compile(r"<FRAMESET[^>]*>", re.I)
@@ -41,6 +42,12 @@ def check_html(file_path, chunk=None):
     for line in temp:
         line = util.unicodify(line)
         lineno += 1
+
+        if '<COPASI xmlns=' in line or '<sbml ' in line or '<sbml:sbml ' in line:
+          if chunk is None:
+            temp.close()
+          return False
+
         matches = regexp1.search(line) or regexp2.search(line) or regexp3.search(line) or regexp4.search(line) or regexp5.search(line)
         if matches:
             if chunk is None:
fbergmann commented 4 years ago

@bgruening as per our discussion today, i just ping you to tell you that this is still an issue today

bgruening commented 4 years ago

@fbergmann Danke vielmals!

bgruening commented 4 years ago

@fbergmann I'm not sure your approach will work. It means that if someone wants to upload html/malicious code just need to prefix the file with '<sbml:sbml ', isn't it?

Do you know if meta name="qrichtext" content="1"/> is always the same, or can it vary? I guess if that is the only exception we could work around it like I tried it here: https://github.com/galaxyproject/galaxy/pull/10144

With that patch all examples from https://github.com/copasi/COPASI/blob/develop/TestSuite/distribution/ do work, but one.

https://github.com/copasi/COPASI/blob/develop/TestSuite/distribution/Genetic-2000Elo.xml#L34 is actually failing because it has an embedded link.

How common is that to paste arbitrary links/html in SBML files? Those are not DOIs? What about allowing compressed SBML files, like foo.xml.gz? Or is viewing and exposing the content to the browser a use-case we need to consider?

bgruening commented 4 years ago

@natefoo any hints from your side?

The problem here seem to be XML files that can embed HTML elements (which we block).