ebi-ait / hca-to-scea-tools

Helpers and tools to assist in the conversion of HCA datasets into SCEA
0 stars 0 forks source link

Conversion hangs after clicking on 'Process' on web interface #23

Closed rays22 closed 3 years ago

rays22 commented 3 years ago

Description of the problem:

The conversion seem to never complete after clicking on Process on the web interface http://127.0.0.1:5000/ . hca-to-scea-tool_Screenshot This happens at step 3.d. in the SOP https://ebi-ait.github.io/hca-ebi-wrangler-central/SOPs/hca_to_scea_tools_SOP.html. There are csv files (and an xlsx file) in a newly created directory: ebi-ait/hca-to-scea-tools/hca2scea-backend/spreadsheets/GSE111976-endometrium_MC-10x

hca2scea-backend@1.0.0 start /home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend node backend.js App is listening on port 5000 extracting CSV sheets from GSE111976-endometrium_MC-10x.xlsx sheets in that XLS: [ 'Project', 'Project - Contributors', 'Project - Publications', 'Project - Funders', 'Donor organism', 'Specimen from organism', 'Cell suspension', 'Supplementary file', 'Sequence file', 'Collection protocol', 'Dissociation protocol', 'Enrichment protocol', 'Library preparation protocol', 'Sequencing protocol' ] CSV Extracted projectDetails {} newProjectDetails { accession: '999', curators: [ 'RS' ] } Info JSON saved converting HCA spreadsheet from spreadsheets/GSE111976-endometrium_MC-10x [HCA-TO-SCEA SH LAUNCHER] ACTIVATING VIRTUALENV [HCA-TO-SCEA SH LAUNCHER] LAUNCHING SCRIPT working at ./spreadsheets/GSE111976-endometrium_MC-10x 14 spreadsheets loaded Index([], dtype='object') ['collection_protocol', 'dissociation_protocol', 'enrichment_protocol', 'library_preparation_protocol', 'sequencing_protocol'] collection_protocol ['collection_protocol'] dissociation_protocol ['dissociation_protocol'] enrichment_protocol ['enrichment_protocol_1'] library_preparation_protocol ['10x_library_protocol'] sequencing_protocol ['GPL24676'] ./hca-to-scea.sh: line 4: ./venv/bin/activate: No such file or directory Sheets converted

ami-day commented 3 years ago

Thanks @rays22 , can you please send me the spreadsheet you used as input so I can recapitulate locally?

ami-day commented 3 years ago

Oh, never mind! just saw you already provided a link to it!

ami-day commented 3 years ago

@rays22 Ok, it wasn't working for this dataset because:

  1. the technology map hasn't yet been updated in the master branch (it's part of ongoing update work). I just updated the map in the master branch to recognise 10X 3' v3 sequencing, so if you pull the latest, it should work now. You can then check the barcode values in the browser, and edit them if they don't look right.

  2. the script doesn't seem to progress when you click 'process!' when there is an enrichment protocol tab. I removed this tab from the sheet as a test, and the script worked as expected. Please for now, can you delete this tab, and I will record this as a bug for a sprint where we continue our work on the scea tool updates. I can show you how to add the enrichment protocol manually, which on it's own is a small task.

I should also guide you through some extra fields which need to be manually curated. The updates were working on are to further automate these steps, but not ready yet to be pushed to master.

ami-day commented 3 years ago

Added the output idf and sdrf files I get in this folder (no manual curation): https://drive.google.com/drive/folders/1kxtAeVMp7HDJ6HyGFsQ3PRRra2R81wuf

ami-day commented 3 years ago

Closing as this is solved.

rays22 commented 3 years ago

Thank you for looking into it.

node backend.js App is listening on port 5000 fs.js:885 return binding.mkdir(pathModule._makeLong(path), Error: ENOENT: no such file or directory, mkdir 'spreadsheets/GSE111976-endometrium_MC-10x_no_enrichment' at Object.fs.mkdirSync (fs.js:885:18) at DiskStorage.destination [as getDestination] (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/backend.js:25:8) at DiskStorage._handleFile (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/multer/storage/disk.js:31:8) at /home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/multer/lib/make-middleware.js:144:17 at allowAll (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/multer/index.js:8:3) at wrappedFileFilter (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/multer/index.js:44:7) at Busboy. (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/multer/lib/make-middleware.js:114:7) at emitMany (events.js:147:13) at Busboy.emit (events.js:224:7) at Busboy.emit (/home/ray/ebi-ait/hca-to-scea-tools/hca2scea-backend/node_modules/busboy/lib/main.js:38:33) npm ERR! code ELIFECYCLE npm ERR! errno 1 npm ERR! hca2scea-backend@1.0.0 start: node backend.js npm ERR! Exit status 1 npm ERR! npm ERR! Failed at the hca2scea-backend@1.0.0 start script. npm ERR! This is probably not a problem with npm. There is likely additional logging output above. npm ERR! A complete log of this run can be found in: npm ERR! /home/ray/.npm/_logs/2021-02-25T12_16_04_757Z-debug.log

rays22 commented 3 years ago

Error: ENOENT: no such file or directory, mkdir 'spreadsheets/GSE111976-endometrium_MC-10x_no_enrichment'

Error: EEXIST: file already exists, mkdir 'spreadsheets/GSE111976-endometrium_MC-10x_no_enrichment'

./install.sh: /home/ray/.local/bin/virtualenv: /home/ray/ebi-ait/hca-to-scea-tools/venv/bin/python3: bad interpreter: No such file or directory
./install.sh: line 6: ./venv/bin/activate: No such file or directory

I have fixed bugs 1-3 manually and in a local branch and the errors have disappeared, but the conversion still hangs with spreadsheet GSE111976-endometrium_MC-10x_no_enrichment.xlsx

I have tried to convert another spreadsheet, and that gets (at least) to the Force a Project UUID stage. I will resume testing and debugging on another day.

ami-day commented 3 years ago

Thanks @rays22 . Weirdly, points 1 and 2 are not a problem when I run it in my local repo - I tried to recapitulate both (deleted spreadsheets folder, ran same script through tool twice), and there was no error message, and the required files were generated. In terms of the 3rd point, I'm not sure about that. I think it'd be worth asking @yusra-haider about it, since she has been working on this repo recently. She probably has a better understanding re: the install script and npm. For now, would it be possible to use the output files I generated and added to the shared folder link above? and I will get back to solving this next time the tool updates are prioritised in a sprint.

ami-day commented 3 years ago

Yusra has installed an all-user version on EC2. We have also installed our upgraded version, which is not interactive. Both will be available. I'm testing it now, and will add the newly generated files to the above shared google folder. Will then update SOP with both versions.

yusra-haider commented 3 years ago

just leaving this here for documentation purposes:

steps taken for installing hca-to-scea tool on EC2 :

  1. install nvm and nodejs following this  guide https://www.digitalocean.com/community/tutorials/how-to-install-node-js-on-ubuntu-16-04#how-to-install-using-nvm and taking the nvm version from here: https://github.com/nvm-sh/nvm/releases 2. installed npm using sudo apt install aptitude followed by sudo apt install npm   (as per this answer here: https://askubuntu.com/a/978353, because I was running into dependency issues)
  2. cd /data/tools  and git clone and set up hca-to-scea tools there.

One issue currently is about running npm start to start the application, which regular users can't execute because I did the installation as ubuntu user :pensive: I'm thinking of just running the app as a linux service, to get around this, and so the wranglers wont even have to run the npm start command -- any feedback on this @ebi-ait/hca-dev?

ami-day commented 3 years ago

thanks for setting this up so quickly!

ami-day commented 3 years ago

Recording example command:

python3 script.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -c RS -tt 10Xv3_3 -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12

Error:

spreadsheets = extract_csv_from_spreadsheet(work_dir, args.spreadsheet)
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

Seems like xlrd no longer supports .xlsx. Probably better to use openpyxl which is compatible.

yusra-haider commented 3 years ago

@ami-day I've fixed the issue above, so it should work now:

(venv) ubuntu@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend$ python script.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -c RS -tt 10Xv3_3 -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12
Converting sheets in excel file to dataframes...
14 sheets converted to dataframes
saving script_spreadsheets/GSE111976-endometrium_MC_SCEA/E-HCAD-28.idf.txt
saving script_spreadsheets/GSE111976-endometrium_MC_SCEA/E-HCAD-28.sdrf.txt

also, python3 doesn't need to be specified anymore. I've set up the virtualenv with python 3.6.13, so just python will work now

rays22 commented 3 years ago

@ami-day I've fixed the issue above, so it should work now:

(venv) ubuntu@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend$ python script.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -c RS -tt 10Xv3_3 -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12 Converting sheets in excel file to dataframes... 14 sheets converted to dataframes saving script_spreadsheets/GSE111976-endometrium_MC_SCEA/E-HCAD-28.idf.txt saving script_spreadsheets/GSE111976-endometrium_MC_SCEA/E-HCAD-28.sdrf.txt

also, python3 doesn't need to be specified anymore. I've set up the virtualenv with python 3.6.13, so just python will work now

@yusra-haider @ami-day The spreadsheet conversion still does not work for me on the EC2. I hit a write permission problem on the EC2:

(venv) /data/tools/hca-to-scea-tools/hca2scea-backend$ python script.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -c RS -tt 10Xv3_3 -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12
Converting sheets in excel file to dataframes...
14 sheets converted to dataframes
Traceback (most recent call last):
  File "script.py", line 592, in <module>
    main()
  File "script.py", line 587, in main
    project_details = prepare_protocol_map(work_dir, spreadsheets, project_info, tracking_sheet, args)
  File "script.py", line 156, in prepare_protocol_map
    big_table = create_big_table(work_dir, spreadsheets)
  File "script.py", line 145, in create_big_table
    big_table.to_csv(f"{work_dir}/big_table.csv", index=False, sep=";")
  File "/data/tools/hca-to-scea-tools/hca2scea-backend/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 3170, in to_csv
    formatter.save()
  File "/data/tools/hca-to-scea-tools/hca2scea-backend/venv/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 190, in save
    compression=dict(self.compression_args, method=self.compression),
  File "/data/tools/hca-to-scea-tools/hca2scea-backend/venv/lib/python3.6/site-packages/pandas/io/common.py", line 493, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
PermissionError: [Errno 13] Permission denied: 'script_spreadsheets/GSE111976-endometrium_MC_SCEA/big_table.csv'
mshadbolt commented 3 years ago

I have put some instructions on running the tool in the EC2 into our documentation here: https://github.com/ebi-ait/hca-ebi-wrangler-central/blob/master/docs/SOPs/hca_to_scea_tools_SOP.md#on-the-wrangler-ec2

I did a test run using the ui and was able to generate the idf and sdrf files, however it isn't very obvious when you click the 'this looks alright' button that anything has happened. After I realised this, I checked the /data/tools/hca-to-scea-tools/hca2scea-backend/spreadsheets/ and saw the files were created.

I then used cyberduck to download the files I wanted to my local machine. (I couldn't get scp/rsync to work, not really sure why)

I do have a question: What happens if two wranglers try to use the ui running on the ec2 at the same time? Would that cause any issues?

ami-day commented 3 years ago

Example command (works for Ami):

python**3** script.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -c RS -tt 10Xv3_3 -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12
ami-day commented 3 years ago

(Above -s command needs to point to where this example spreadsheet is on EC2)

ami-day commented 3 years ago

@rays22 Did you try this command with python3?

ami-day commented 3 years ago

I have put some instructions on running the tool in the EC2 into our documentation here: https://github.com/ebi-ait/hca-ebi-wrangler-central/blob/master/docs/SOPs/hca_to_scea_tools_SOP.md#on-the-wrangler-ec2

I did a test run using the ui and was able to generate the idf and sdrf files, however it isn't very obvious when you click the 'this looks alright' button that anything has happened. After I realised this, I checked the /data/tools/hca-to-scea-tools/hca2scea-backend/spreadsheets/ and saw the files were created.

I then used cyberduck to download the files I wanted to my local machine. (I couldn't get scp/rsync to work, not really sure why)

I do have a question: What happens if two wranglers try to use the ui running on the ec2 at the same time? Would that cause any issues?

Good point. I'm not sure. I think @yusra-haider will have a better idea. This should not be an issue for the python command-line version. @yusra-haider we have decided to keep both the npm start and command-line versions, so wranglers can try both and see what they prefer for now. They know that the command-line has upgrades, npm version doesn't.

lauraclarke commented 3 years ago

@ami-day I thought we decided that one was better than two so long as it was functional and documented?

ami-day commented 3 years ago

@lauraclarke I understood we agreed both, as @rays22 wanted to try using the interactive version too, to see what they are both like. However, I am happy to keep only the command-line option. That is what I intend to always use until we get it integrated with the UI.

ami-day commented 3 years ago

Working on the documentation for command-line version later today.

rays22 commented 3 years ago

@lauraclarke I understood we agreed both, as @rays22 wanted to try using the interactive version too, to see what they are both like. However, I am happy to keep only the command-line option. That is what I intend to always use until we get it integrated with the UI.

@ami-day : Please, note that I do not prefer the the interactive version to the command-line one. Furthermore, I would prefer a version that is maintainable and works on the EC2. My understanding from the meeting was that it is the command-line version that is maintainable, so my vote goes to that one. The issue I have is still the same as above: https://github.com/ebi-ait/hca-to-scea-tools/issues/23#issuecomment-788845131

I can also confirm that I am getting the same error on the EC2 as yesterday. The python command invokes python3 for me on the EC2, so it does not make any difference:

 python --version
Python 3.6.13
amnonkhen commented 3 years ago

I noticed that when a Wrangler runs this tool the artifacts created are owned by the Wrangler's user account and their own group.

ubuntu@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend/script_spreadsheets/GSE111976-endometrium_MC_SCEA$ ls -ld
drwxrwxr-x 2 aday aday 6144 Mar  1 14:47 .

ubuntu@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend/script_spreadsheets/GSE111976-endometrium_MC_SCEA$ ls -l
total 76
-rw-rw-r-- 1 aday aday 52702 Mar  1 14:31 big_table.csv
-rw-rw-r-- 1 aday aday  6745 Mar  1 14:31 E-HCAD-28.idf.txt
-rw-rw-r-- 1 aday aday 13666 Mar  1 14:31 E-HCAD-28.sdrf.txt

Consequently, if a different Wrangler tries to run the same command, they would get a write permission error, because the files are not writeable.

A possible fix would be:

ami-day commented 3 years ago

That sounds good @amnonkhen !

@rays22 I have had a first go at the new SCEA documentation, in the hca-to-scea tools README. It is quite an intricate process curating post automated curation - I still get things not quite right from time to time and Anja, Nancy or Silvie review, make minor edits. Would it be possible to try it out, following the guide I wrote, and sending me your curated files? I can then look and see where I need to improve my documentation.

ami-day commented 3 years ago

I wonder if it's worth adding Anja, Nancy, Silvie to the EC2 as users for this tool, if they decide they want to curate more datasets from HCA? @lauraclarke @clairerye what do you think? The tool scans the tracker sheet for existing E-HCAD-ids so hopefully it wouldn't result in confusion between ids.

rays22 commented 3 years ago

That sounds good @amnonkhen !

@rays22 I have had a first go at the new SCEA documentation, in the hca-to-scea tools README. It is quite an intricate process curating post automated curation - I still get things not quite right from time to time and Anja, Nancy or Silvie review, make minor edits. Would it be possible to try it out, following the guide I wrote, and sending me your curated files? I can then look and see where I need to improve my documentation.

Thanks for updating the document @ami-day I will follow the new SCEA documentation and send you the curated file for review.

ami-day commented 3 years ago

I'm trying to add examples files using Git add and I am getting this error: (venv) aday@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend$ git add ../examples/* error: insufficient permission for adding an object to repository database .git/objects error: examples/HCAD_E-HCAD-23_E-HCAD-23.idf.txt: failed to insert into database error: unable to index file examples/HCAD_E-HCAD-23_E-HCAD-23.idf.txt fatal: adding files failed (venv) aday@ip-172-31-71-222:/data/tools/hca-to-scea-tools/hca2scea-backend$ I think I don't have the right permissions?

ami-day commented 3 years ago

Since we deprecating the flask version of the tool, I am closing this issue. Interactive features will gradually be added to the UI version of this tool, which is yet to be integrated with ingest.