galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.41k stars 1.01k forks source link

sniffer problem in 19.05? #7957

Closed jmchilton closed 5 years ago

jmchilton commented 5 years ago

From internal bug list:

> Attempt to upload GTF from pseudomonas.com (genome resource) failed with
> the error above. Not sure how to proceed:
> Fatal error: Exit code 1 ()
> Traceback (most recent call last):
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 326, in <module>
>     __main__()
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 319, in __main__
>     metadata.append(add_file(dataset, registry, output_path))
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 128, in add_file
>     convert_spaces_to_tabs=dataset.space_to_tab,
>   File "/cvmfs/
> main.galaxyproject.org/galaxy/lib/galaxy/datatypes/upload_util.py", line
> 54, in handle_upload
>     convert_spaces_to_tabs=convert_spaces_to_tabs,
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 770, in handle_uploaded_dataset_file_internal
>     line_count, _converted_path = convert_fxn(converted_path,
> in_place=in_place, tmp_dir=tmp_dir, tmp_prefix=tmp_prefix)
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 122, in convert_newlines
>     for i, line in enumerate(io.open(fname, mode="U", encoding='utf-8')):
>   File "/cvmfs/main.galaxyproject.org/venv/lib/python2.7/codecs.py", line
> 314, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 690:
> invalid start byte
mvdbeek commented 5 years ago

The problem is that we do the newline / space to tab conversion for text datatypes, but that fails with non-utf8 characters. So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail. Automatic recoding isn't perfect either. Any ideas here @nsoranzo ?

mvdbeek commented 5 years ago

xref https://github.com/galaxyproject/galaxy/issues/5250 / https://github.com/galaxyproject/galaxy/issues/7412#issuecomment-469631452

jmchilton commented 5 years ago

Given the timing I assumed it was a new issue, sorry for not searching. Hmm...

jmchilton commented 5 years ago

I think this is the file and I think it is readily reproducible on test.galaxyproject.org.

http://www.pseudomonas.com/downloads/pseudomonas/pgd_r_17_2/Pseudomonas_aeruginosa_PAO1_107/Pseudomonas_aeruginosa_PAO1_107_genes_and_intergenic.gtf

jmchilton commented 5 years ago

So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail.

If we want to reject files that contain non-UTF encoding that is fine but we should do that explicitly and not as part of newline conversion right? I'm not sure how to figure out where the newlines are in rb mode though right... I feel like we should just give up on convert_newlines if there is a UnicodeDecodeError error.

mvdbeek commented 5 years ago

I'm not sure how to figure out where the newlines are in rb mode though right.

There may be some caveats I'm not thinking about, but this seems to work (this file is in ISO-8859-1, which raises the same error):

wget https://raw.githubusercontent.com/galaxyproteomics/tools-galaxyp/master/tools/cardinal/test-data/Example_Processed.imzML
unix2dos Example_Processed.imzML

and then

with open('test', 'wb') as fp:
    for i, line in enumerate(open('Example_Processed.imzML', mode='rb')):
        fp.write(b"%s\n" % line.rstrip(b"\r\n"))

seems to work and convert the newlines properly

mvdbeek commented 5 years ago

It'll still not be utf-8, but I agree with you that failing the newline conversion shouldn't be the way we reject non utf8 files.

peterjc commented 5 years ago

Looks like this might be causing the following,

https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-492653885

======================================================================
FAIL: ( venn_list ) > Test-1
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 99, in test_tool
    self.do_it(tool_version=tool_version, test_index=test_index)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 36, in do_it
    verify_tool(tool_id, self.galaxy_interactor, resource_parameters=resource_parameters, test_index=test_index, tool_version=tool_version, register_job_data=register_job_data)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 779, in verify_tool
    raise e
JobOutputsError: 'utf8' codec can't decode byte 0xac in position 10: invalid start byte

In this case the first line of the PDF generated by matplotlib in the test is likely causing this, making me suspect the sniffer code:

$ hexdump -C example.pdf | head -n 1
00000000  25 50 44 46 2d 31 2e 34  0a 25 ac dc 20 ab ba 0a  |%PDF-1.4.%.. ...|
mvdbeek commented 5 years ago

That's another good point that we should probably do https://github.com/galaxyproject/galaxy/issues/7957#issuecomment-492377768 and not worry aobut whether this is a valid downstream format, I can PR this

jmchilton commented 5 years ago

@mvdbeek I was working on https://github.com/jmchilton/galaxy/commit/f8fb89006b3471d457c89e41f4503f449d33e90e - I think it will conflict badly. Any chance you can start from there, I understand if you'd rather just fix this bug though.

jmchilton commented 5 years ago

Or I can try to pour some more time into this this afternoon and see if I can integrate your suggestion. I just don't understand how that works, but I believe you that it does.

mvdbeek commented 5 years ago

If you can give it a try that'd be great.

jmchilton commented 5 years ago

I can't get rb mode to recognize carriage return as a newline at all.

>>> from six import BytesIO
>>> BytesIO("1 2\r3 4").readline()
'1 2\r3 4'
>>> for line in BytesIO("1 2\r3 4"):
...     print line.rstrip(b"\r\n")
...
3 4

I ran the unit tests on your proposed solution above and my memory-limited variant and the same thing happened both times.

We could try reading the file byte by byte instead?

jmchilton commented 5 years ago

https://github.com/jmchilton/galaxy/commit/778b1c293c51197a2db5c5c1c31bc37b43c7a340 fails the unit tests 😢

mvdbeek commented 5 years ago

Fixed in https://github.com/galaxyproject/galaxy/pull/7995

peterjc commented 5 years ago

This should have been fixed on the dev branch, right? I'm still seeing the problem with a matplotlib generated PDF file https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-493954485

Should I upload a small PDF file as a test case?

mvdbeek commented 5 years ago

Thanks @peterjc, this is because the test interactor assumes utf-8 encoded text data when using contains. https://github.com/galaxyproject/galaxy/pull/8010 is going to fix that