Closed jmchilton closed 5 years ago
The problem is that we do the newline / space to tab conversion for text datatypes, but that fails with non-utf8 characters. So we could try reading this in rb
mode instead, but there's a good chance downstream tools would fail. Automatic recoding isn't perfect either. Any ideas here @nsoranzo ?
Given the timing I assumed it was a new issue, sorry for not searching. Hmm...
I think this is the file and I think it is readily reproducible on test.galaxyproject.org.
So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail.
If we want to reject files that contain non-UTF encoding that is fine but we should do that explicitly and not as part of newline conversion right? I'm not sure how to figure out where the newlines are in rb mode though right... I feel like we should just give up on convert_newlines if there is a UnicodeDecodeError error.
I'm not sure how to figure out where the newlines are in rb mode though right.
There may be some caveats I'm not thinking about, but this seems to work (this file is in ISO-8859-1, which raises the same error):
wget https://raw.githubusercontent.com/galaxyproteomics/tools-galaxyp/master/tools/cardinal/test-data/Example_Processed.imzML
unix2dos Example_Processed.imzML
and then
with open('test', 'wb') as fp:
for i, line in enumerate(open('Example_Processed.imzML', mode='rb')):
fp.write(b"%s\n" % line.rstrip(b"\r\n"))
seems to work and convert the newlines properly
It'll still not be utf-8, but I agree with you that failing the newline conversion shouldn't be the way we reject non utf8 files.
Looks like this might be causing the following,
https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-492653885
======================================================================
FAIL: ( venn_list ) > Test-1
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 99, in test_tool
self.do_it(tool_version=tool_version, test_index=test_index)
File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 36, in do_it
verify_tool(tool_id, self.galaxy_interactor, resource_parameters=resource_parameters, test_index=test_index, tool_version=tool_version, register_job_data=register_job_data)
File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 779, in verify_tool
raise e
JobOutputsError: 'utf8' codec can't decode byte 0xac in position 10: invalid start byte
In this case the first line of the PDF generated by matplotlib in the test is likely causing this, making me suspect the sniffer code:
$ hexdump -C example.pdf | head -n 1
00000000 25 50 44 46 2d 31 2e 34 0a 25 ac dc 20 ab ba 0a |%PDF-1.4.%.. ...|
That's another good point that we should probably do https://github.com/galaxyproject/galaxy/issues/7957#issuecomment-492377768 and not worry aobut whether this is a valid downstream format, I can PR this
@mvdbeek I was working on https://github.com/jmchilton/galaxy/commit/f8fb89006b3471d457c89e41f4503f449d33e90e - I think it will conflict badly. Any chance you can start from there, I understand if you'd rather just fix this bug though.
Or I can try to pour some more time into this this afternoon and see if I can integrate your suggestion. I just don't understand how that works, but I believe you that it does.
If you can give it a try that'd be great.
I can't get rb mode to recognize carriage return as a newline at all.
>>> from six import BytesIO
>>> BytesIO("1 2\r3 4").readline()
'1 2\r3 4'
>>> for line in BytesIO("1 2\r3 4"):
... print line.rstrip(b"\r\n")
...
3 4
I ran the unit tests on your proposed solution above and my memory-limited variant and the same thing happened both times.
We could try reading the file byte by byte instead?
https://github.com/jmchilton/galaxy/commit/778b1c293c51197a2db5c5c1c31bc37b43c7a340 fails the unit tests 😢
This should have been fixed on the dev
branch, right? I'm still seeing the problem with a matplotlib generated PDF file https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-493954485
Should I upload a small PDF file as a test case?
Thanks @peterjc, this is because the test interactor assumes utf-8 encoded text data when using contains
. https://github.com/galaxyproject/galaxy/pull/8010 is going to fix that
From internal bug list: