Duke-GCB / DukeDSClient

Command line program to allow uploading, downloading, and managing projects in the duke-data-service.
MIT License
5 stars 6 forks source link

can't encode character u'\xa0' in position xxx: ordinal not in range(128)/bin/sh: ddsclient: command not found #68

Closed jonturneratduke closed 8 years ago

jonturneratduke commented 8 years ago

While attempting to upload a hierarchical folder structure today, we encountered the following error:

Progress: 1% - sending /Volumes/all_staff/Robert Bastidas/Manuscripts/CSHL : Bacterial pathogenesis 2012/Tables/._Table 1.xlsx 'ascii' codec can't encode character u'\xa0' in position 120: ordinal not in range(128)/bin/sh: ddsclient: command not found

Is the problem with the filename or the contents?

johnbradley commented 8 years ago

You can add the debug: True flag to your ~/.ddsclient file to see where the error is. I will see if I can reproduce later tonight.

johnbradley commented 8 years ago

I'm willing to bet it is a filename or folder name. I don't see any reason it would be translating the bytes within a file. I tried reproducing the error by creating a file with '\n', but I started getting 500 errors back. I'll try again tomorrow.

jonturneratduke commented 8 years ago

thx. will know more later today... update: Ran the tool today and it crashed before uploading the first file:

$ tail /tmp/dds-upload.log remote_id = file_content_sender.upload(self.project_id, parent.kind, parent.remote_id) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 57, in upload chunk_processor.run() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 182, in run wait_for_processes(processes, num_chunks, progress_queue, self.watcher, self.local_file) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 175, in wait_for_processes watcher.transferring_item(item, increment_amt=chunk_size) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 55, in transferring_item sys.stdout.write(message.ljust(self.max_width)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 120: ordinal not in range(128)

Oddly, we immediately re-ran the command and it started uploading.

jonturneratduke commented 8 years ago

more results. It looks like two different problems to me: the filename issue, and a dropped connection. The filename looks reasonable to me:

01-CTL2M522 and 687 backcrossing and linkage analysis.docx

Uploading 0 projects, 1427 folders, 29116 files.

Progress: 0% - sending /Volumes/all_staff/Robert Bastidas/Electronic notebook/.DS_Store
Progress: 0% - sending /Volumes/all_staff/Robert Bastidas/Electronic notebook/Essential gene set project LGV L2/04-Sequencing CTL2M921-CTL2M1712 with 24 bp barcoded seq library.docx
Progress: 0% - sending /Volumes/all_staff/Robert Bastidas/Electronic notebook/NFKB inducing mutants/CTL2M687/01-CTL2M522 and 687 backcrossing and linkage analysis.docx              Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/ddsclient", line 11, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/__main__.py", line 15, in main
    client.run_command(args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/ddsclient.py", line 33, in run_command
    parser.run_command(args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/cmdparser.py", line 309, in run_command
    parsed_args.func(parsed_args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/ddsclient.py", line 59, in <lambda>
    return lambda args: self._run_command(command_constructor, args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/ddsclient.py", line 68, in _run_command
    command.run(args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/ddsclient.py", line 97, in run
    project_upload.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/upload.py", line 58, in run
    sender.walk_project(self.local_project)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/upload.py", line 287, in walk_project
    ProjectWalker.walk_project(project, self)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 80, in walk_project
    ProjectWalker._visit_content(project, None, visitor)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 98, in _visit_content
    ProjectWalker._visit_content(child, item, visitor)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 98, in _visit_content
    ProjectWalker._visit_content(child, item, visitor)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 98, in _visit_content
    ProjectWalker._visit_content(child, item, visitor)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 98, in _visit_content
    ProjectWalker._visit_content(child, item, visitor)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 95, in _visit_content
    visitor.visit_file(item, parent)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/upload.py", line 320, in visit_file
    remote_id = file_content_sender.upload(self.project_id, parent.kind, parent.remote_id)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 57, in upload
    chunk_processor.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 182, in run
    wait_for_processes(processes, num_chunks, progress_queue, self.watcher, self.local_file)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 175, in wait_for_processes
    watcher.transferring_item(item, increment_amt=chunk_size)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/util.py", line 55, in transferring_item
    sys.stdout.write(message.ljust(self.max_width))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 120: ordinal not in range(128)```

and from the re-run on Friday: 
`...7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa CProgress: 7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa CProgress: 7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa CProgress: 7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa C
Progress: 7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa CProgress: 7% - sending /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens/Caffeine screen/Results/Cell Wall Integrity screen/HeLa CT621 synchronization/CT621_EGFP HeLa synchronization images/9.5 hours post release/._T9.5 HeLa DIC.tif         Process Process-1917:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 249, in upload_async
    error_msg = sender.send()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 291, in send
    error_msg = self._send_chunk(chunk, chunk_num)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 310, in _send_chunk
    FileUploader.send_file_external(self.data_service, resp.json(), chunk)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/fileuploader.py", line 80, in send_file_external
    resp = data_service.send_external(http_verb, host, url, http_headers, chunk)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ddsc/core/ddsapi.py", line 405, in send_external
    return requests.put(host + url, data=chunk, headers=http_headers)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 124, in put
    return request('put', url, data=data, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 57, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 585, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/adapters.py", line 453, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(22, 'Invalid argument'))`
johnbradley commented 8 years ago

UnicodeEncodeError This error occurs when we are trying to print the progress update with the filename. '01-CTL2M522 and 687 backcrossing and linkage analysis.docx' is the file before the bad one. Could you look in that directory to see if there are any funny named files? This might help:

find . -name $'*\n*'

You could try just uploading one directory or one file at a time in the bad directory to figure out where it fails. 0xa0 is the newline character. I tried uploading a project with a newline character in a filename but got back a 500 error from the complete the upload DDS command.

ConnectionError I think this is probably related to the first item. This is the point at which we send the data to swift. We send the bytes(which can't be invalid in my understanding) to the url sent back by request to DDS for an upload url. So perhaps we are getting back an error or something unexpected there.

Changes I will change the progress printer to catch the UnicodeEncodeError and strip non-printables out. I still feel like it should fail, but perhaps even earlier unless you want to support non-printables in DDS. I will change the Connection error to print out the url and http headers.

jonturneratduke commented 8 years ago

Thanks John. Interesting. I'll follow up this afternoon.

jonturneratduke commented 8 years ago

No files found with newline in either folder.

rjb-macbook-pro:CTL2M522 Robert_Bastidas$ pwd /Volumes/all_staff/Robert Bastidas/Electronic notebook/NFKB inducing mutants/CTL2M522 rjb-macbook-pro:CTL2M522 Robert_Bastidas$ ls -la total 1992 drwx------ 1 Robert_Bastidas staff 16384 Jul 12 11:31 . drwx------ 1 Robert_Bastidas staff 16384 May 24 12:04 .. -rwx------@ 1 Robert_Bastidas staff 758970 May 26 11:23 01-CTL2M522 and 687 backcrossing and linkage analysis.docx -rwx------@ 1 Robert_Bastidas staff 116787 Jul 12 11:28 02_Generating cdu1 (ChlaDUB1) complementation constructs.docx -rwx------@ 1 Robert_Bastidas staff 108775 Jul 12 11:31 03_Generating cdu1 (ChlaDUB1) TargeTron vectors.docx rjb-macbook-pro:CTL2M522 Robert_Bastidas$

rjb-macbook-pro:S.cerevisiae C.t ORF screens Robert_Bastidas$ pwd /Volumes/all_staff/Robert Bastidas/Project DATA/Erk activation/S.cerevisiae C.t ORF screens rjb-macbook-pro:S.cerevisiae C.t ORF screens Robert_Bastidas$ ls -l total 656 -rwx------ 1 Robert_Bastidas staff 60257 May 17 2011 CT ORF's with phenotypes in w303.xlsx -rwx------ 1 Robert_Bastidas staff 133120 Jan 19 2012 C_trachomatis predicted effectors.xls drwx------ 1 Robert_Bastidas staff 16384 Dec 16 2011 Caffeine screen -rwx------ 1 Robert_Bastidas staff 39379 Dec 28 2011 DLY strains.xlsx -rwx------ 1 Robert_Bastidas staff 52946 Jul 28 2010 Ideas for yeast screen.docx drwx------ 1 Robert_Bastidas staff 16384 Jul 17 2012 Lethal ORFs drwx------ 1 Robert_Bastidas staff 16384 Dec 16 2011 Published and Ine's results rjb-macbook-pro:S.cerevisiae C.t ORF screens Robert_Bastidas$

johnbradley commented 8 years ago

I have a branch with some extra logging enabled we can try to run to see what filename it doesn't like. https://github.com/Duke-GCB/DukeDSClient/tree/weird_filename

@jonturneratduke Why don't we take a look at this together to see what is going on with this directory?

johnbradley commented 8 years ago

Looked at this issue. I was able to upload the folders that failed. We are uploading from a mounted volume. I think the issue is related to this. Perhaps we are losing our mount and we get back invalid filenames?

Unfortunately I haven't been able to reproduce as we are getting 503 errors for /folder/id/children. Jon was able to see heroku just restarted DukeDS. We have apparently exhausted the allocated memory.

Saw an issue where a .DSStore file changed while being uploaded and this produced an error from the backend. Perhaps a config setting to skip hidden files might help with this particular instance. I think this was because I had a finder folder open.

I will to try and reproduce the issue by testing with a thumb drive and seeing if the same issue happens when volume disappears. I will also test re-uploading large directory trees.

I was able to see that it takes a really long time to pull all the files down from DukeDS for large directory trees.

johnbradley commented 8 years ago

Got message that we might be able to improve efficiency of get children by not passing the name_contains with an empty string. This might help with the 503 errors.

johnbradley commented 8 years ago

Avoid Sub-Projects While looking at this yesterday I noticed that we were uploading folders containing projects into one DukeDSProject. For example CSHL : Bacterial pathogenesis 2012 seems like it should be it's own DukeDSProject. This will make sharing just one project with another user possible.

The initial check to determine what files to upload was taking forever. This is due to the time it takes to request folder children and file hashes from DukeDS. It takes between 0.25 seconds to 1.5 seconds to list the contents of a folder. While getting the hash for each file takes 0.25 seconds to 0.8 seconds.

You should create a script to upload each project folder into it's own DukeDSProject. This will reduce the time spent asking DukeDS for directory contents and file hashes.

DukeDSClient is not Dropbox The use case you are trying to achieve (as I understand it) is to function similar to dropbox. This is out of the scope of the initial development done on DukeDSClient. There has been no work to support users modifying files while they are being uploaded. There is no mechanism for keeping track of which files have been uploaded. All of the commercial implementations of this I have seen run as a service. ddsclient (as of now) is meant to run from the command line and exit. You are more than welcome to contribute changes to meet this need.

Forthcoming Changes I am going to make changes to skip un-necessary OSX files #70. The original error occurs while printing out the progress bar. I will make changes to strip out non-printables (such as '\xa0' above) when printing the progress bar, but if the issue is related to instability with the remote mounted folder it may not help.

johnbradley commented 8 years ago

@jonturneratduke could you try changing Robert's Terminal encoding to UTF-8? If you click Terminal -> Preferences you can adjust this setting.

screen shot 2016-07-14 at 4 02 18 pm

I was unable to reproduce this from the command line until @dleehr told me about this setting.

This will tell you a terminal's current settings:

echo $LANG

You want it to look like this:

en_US.UTF-8
johnbradley commented 8 years ago

How to reproduce the original error: With ddsclient installed running on top of python2.7

mkdir /tmp/issue68
python -c "open(u'/tmp/issue68/test\xa0.txt','w')"
export LANG=en_US.US-ASCII
ddsclient upload -p issue68 /tmp/issue68

Results in

...
    sys.stdout.write(message.ljust(self.max_width))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 43: ordinal not in range(128)

I made a change to fix the progress bar to strip non-printable characters but then the problem just moved to the printed report of files that were uploaded. I really don't like the idea of having the report of files uploaded being inaccurate.

Since unicode support is an expected part of DukeDS, I think the best solution is to error out immediately if the user is using the ASCII encoding for their terminal. The error message will contain a link to instructions for setting the encoding to a proper value.

hlapp commented 8 years ago

Since unicode support is an expected part of DukeDS, I think the best solution is to error out immediately if the user is using the ASCII encoding for their terminal.

👍

dleehr commented 8 years ago

Is this fixed by #71 ?

johnbradley commented 8 years ago

71 fixes the original error at the top.

70 fixes uploading the '.DS_Store' and other un-necessary files

That still leaves how long it takes to request the hashes for projects with many files while uploading. That will have to wait for this DukeDS API change: https://github.com/Duke-Translational-Bioinformatics/duke-data-service/issues/671 Marking this closed and will open new issue if/when that change is implemented.

shauryayellam commented 6 years ago

Hi Jonh,

I am trying to write data in to hdfs location below is exception . please can you help me on same . thank you... org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/data/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/data/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/data/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream for obj in iterator: File "/data/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1494, in func File "/apps/code/sqoop/script/bad/scripts/db2_hive_may25.py", line 510, in db2_data_df.rdd.map(lambda record: reformatRow(record,numberOfColumns)).saveAsTextFile("/data/user/hive/warehouse/SparkExtractionTextFile/"+table_name.lower()) File "/apps/code/sqoop/script/bad/scripts/db2_hive_may25.py", line 460, in reformatRow formattedRow+=str(record[i])+'~' #record[i].encode("utf-8")+"~" UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 70: ordinal not in range(128)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:129)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:125)
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
dleehr commented 6 years ago

@shauryayellam Thanks for your interest, I realize that Python encoding issues are challenging, but this repo (DukeDSClient) is not the place to get help on Spark/HDFS issues.