Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Conda dropseq #41

Closed cgirardot closed 6 years ago

cgirardot commented 6 years ago

Hi @Hoohm , I am using your nice pipeline to process dropseq and decided to contribute on one of your TODO. I have added drop-seq-tools in conda and adapted your pipeline accordingly. The modifications are as follows:

  1. created a dropseq_tools.yaml env
  2. moved drop-seq-tools-wrapper.sh in scripts dir and adapted it to directly use drop-seq-tools exposed by conda. The tmp_dir variable is now injected in the java command line using _JAVA_OPTIONS
  3. changed all relevant .smk to get drop-seq_tools from conda and call drop-seq-tools-wrapper.sh ; config file also adapted accordingly to remove the now useless param

I also changed the following :

  1. Improved the output and error cluster file names in cluster.yaml
  2. STAR indexing now runs remote

I hope you'll like the changes

Hoohm commented 6 years ago

Hello @cgirardot ! Thank you so much for this. I was actually looking at it lately but didn't finish it! This is amazing! Now the pipeline should be snakemake "workflow" approved!

Sadly the travis build failed. I'm not sure why though. It can't find the bash script's path. I see that you changed the path to the wrapper once already. I guess that the problem is linked to this. I'll also test it locally and try to figure it out.

Again, I am very grateful for your help on this!

cgirardot commented 6 years ago

this is weird. I initially had the reference to the script as dropseq_wrapper='../scripts/drop-seq-tools-wrapper.sh' but this gave me the error we now see on failed build. I therefore changed it to dropseq_wrapper='scripts/drop-seq-tools-wrapper.sh' and the pipeline then worked like a charm (it is actually currently running) :-/

cgirardot commented 6 years ago

For example, I had :

Activating conda environment: /scratch/girardot/dropSeqPipe/dropSeqPipeTest/.snakemake/conda/8a6a88f2
/usr/bin/bash: ../scripts/drop-seq-tools-wrapper.sh: No such file or directory
    (...) skipped for clarity

After changing ../scripts/drop-seq-tools-wrapper.sh to scripts/drop-seq-tools-wrapper.sh :

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 100
Job counts:
    count   jobs
    1   create_intervals
    1   meta
    2

localrule create_intervals:
    input: /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict, /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf
    output: /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.rRNA.intervals
    jobid: 5

Activating conda environment: /scratch/girardot/dropSeqPipe/dropSeqPipeTest/.snakemake/conda/8a6a88f2
+ export '_JAVA_OPTIONS=-Djava.io.tmpdir=/tmp '
+ _JAVA_OPTIONS='-Djava.io.tmpdir=/tmp '
+ CreateIntervalsFiles REDUCED_GTF=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf SEQUENCE_DICTIONARY=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict O=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen PREFIX=HUMAN_ZsGreen
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp 
[Thu Aug 02 14:46:27 CEST 2018] org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles SEQUENCE_DICTIONARY=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict REDUCED_GTF=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf OUTPUT=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen PREFIX=HUMAN_ZsGreen    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Aug 02 14:46:27 CEST 2018] Executing as girardot@spinoza.embl.de on Linux 3.10.0-693.5.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Thu Aug 02 14:47:44 CEST 2018] org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles done. Elapsed time: 1.31 minutes.

I am a bit confused with the use of ../ or omitting it. I see it is always present in your plot rules where a R script located in the ''script'' dir is called (which is why I first thought I'd need it) or when rules refer to conda envs. But it is not present when defining input/output files . I am not such a snakemake expert; if you could enlighten me as when to use it or not, maybe I could fix the issue.

Best

Hoohm commented 6 years ago

The paths in snakemake can be a bit confusing. Inputs and outputs are always relative to the main Snakefile.

The other paths, such as shell/scripts calls are relative to the snakefile they are called from. Since all the rules are located in the rules folder, all the calls have to come up one folder, hence the ../ for the calls/envs.

I just tried your branch locally and I get the same error whether with or without the ../

On Thu, Aug 2, 2018 at 10:37 AM, cgirardot notifications@github.com wrote:

For example, I had :

Activating conda environment: /scratch/girardot/dropSeqPipe/dropSeqPipeTest/.snakemake/conda/8a6a88f2 /usr/bin/bash: ../scripts/drop-seq-tools-wrapper.sh: No such file or directory (...) skipped for clarity

After changing ../scripts/drop-seq-tools-wrapper.sh to scripts/drop-seq-tools-wrapper.sh :

Building DAG of jobs... Using shell: /usr/bin/bash Provided cluster nodes: 100 Job counts: count jobs 1 create_intervals 1 meta 2

localrule create_intervals: input: /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict, /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf output: /scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.rRNA.intervals jobid: 5

Activating conda environment: /scratch/girardot/dropSeqPipe/dropSeqPipeTest/.snakemake/conda/8a6a88f2

  • export '_JAVA_OPTIONS=-Djava.io.tmpdir=/tmp '
  • _JAVA_OPTIONS='-Djava.io.tmpdir=/tmp '
  • CreateIntervalsFiles REDUCED_GTF=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf SEQUENCE_DICTIONARY=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict O=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen PREFIX=HUMAN_ZsGreen Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp [Thu Aug 02 14:46:27 CEST 2018] org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles SEQUENCE_DICTIONARY=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.dict REDUCED_GTF=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen/HUMAN_ZsGreen.reduced.gtf OUTPUT=/scratch/girardot/dropSeqPipe/genome/Homo_sapiens.GRCh38.90.ZsGreen PREFIX=HUMAN_ZsGreen VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json [Thu Aug 02 14:46:27 CEST 2018] Executing as girardot@spinoza.embl.de on Linux 3.10.0-693.5.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033) [Thu Aug 02 14:47:44 CEST 2018] org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles done. Elapsed time: 1.31 minutes.

I am a bit confused with the use of ../ or omitting it. I see it is always present in your plot rules where a R script located in the ''script'' dir is called (which is why I first thought I'd need it) or when rules refer to conda envs. But it is not present when defining input/output files . I am not such a snakemake expert; if you could enlighten me as when to use it or not, maybe I could fix the issue.

Best

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/pull/41#issuecomment-409949058, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaO60AGy36oLk1T5q00VFcEmaX3kVks5uMw64gaJpZM4VsRD_ .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

Hoohm commented 6 years ago

Could you run the same thing but using -p? This would output the command line that is actually called. It might help

Hoohm commented 6 years ago

I think I understand what's happening. Using shell to run the script will run from the "working dir". This means that if you clone the repo, add the data to this same folder and run it without anything else, it will find the path properly.

If you use the --directory argument and point towards another folder containing only the data, it won't find the script because they are not there. (This is what's happening on travis-ci with the test data)

One "easy" fix would be to add the bash script to the environment's paths. I guess this could be done in the build.sh from the conda repo?

cgirardot commented 6 years ago

If I well understand, you are suggesting to pack your script in the dropseq-tools conda package. Correct? If so, I would not go down this path as the conda recipe uses dropseq tools as packaged from authors and one does not want to maintain another packaging process. I have another idea. I can get rid of your script altogether. This should be quite easy. I actually did this first but changed to the current solution. I 'll do this tomorrow and submit another PR.

Hoohm commented 6 years ago

I understand what you mean. The only reason for the bash wrapper was to pass down memory and temp paths. As long as this can be done differently, the wrapper is not required.

cgirardot commented 6 years ago

memory adjustment is already supported by the dropseq wrappers and tmp path can be injected using the _JAVA_OPTION. It should be straightforward. Another approach is to use absolute path to your wrapper script if snakemake defines some variable but I am not snakemake fluent enough. 

Hoohm commented 6 years ago

It crashed but it's because of a conflict between some R packages. I'm fixing it right now.

cgirardot commented 6 years ago

my pipeline now successfully completed using this code. I did not see the crash. I actually did not know every commit would be picked up directly !

Hoohm commented 6 years ago

You would not have encountered the conflict because the env for the plotting didn't change in the envs/plots.yaml file. Here it crashed because of some update on the conda package side. I haven't fixed all the packages versions yet. That's on me.

You didn't rerun the meta_generation, so you missed a few temp parameters. I've fixed those as well.

cgirardot commented 6 years ago

is there a way to run the pipeline with test data locally ? It is indeed taking ages the way I do it and I therefore avoid re-running it all. I also did not realise in the first place that my commits would be picked up directly : I develop on my laptop and push so I can pull from the server and run the tests (but in the meantime travis seems to wake up). Sorry I need to understand the PR lifecycle better.

Hoohm commented 6 years ago

You can definitely use the test data from this https://github.com/Hoohm/scngs-test-data/tree/fa6f142da9da740b68f27fceb2dde942ef43d696 git repo to test.

I'm gonna update this repo as well since I don't need the wrapper anymore.

The test data is on a git submodule in the .test folder.

Hoohm commented 6 years ago

The merged version passed also! Thanks a lot for that. This will help a lot in the future. If you let me know if you have any other general comments about dropSeqPipe that you would want to talk about.

cgirardot commented 6 years ago

thanks! It is a great pipeline!

On 3. Aug 2018, at 19:31, Patrick Roelli notifications@github.com wrote:

The merged version passed also! Thanks a lot for that. This will help a lot in the future. If you let me know if you have any other general comments about dropSeqPipe that you would want to talk about.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany

Hoohm commented 6 years ago

Did you take the time to announce the conda package availability to James Nemesh?

If not, I think it would be a good idea. I can do it if you wish. There is a dropseq mailing list as well that might be interested.

cgirardot commented 6 years ago

no I haven’t. This is indeed a good idea. I can t find James Nemesh email on the Mc Carroll lab. If you could let him know and also the ML you r mentioning, this woudl be great. Thx

On 6. Aug 2018, at 21:37, Patrick Roelli notifications@github.com wrote:

Did you take the time to announce the conda package availability toJames Nemesh?

If not, I think it would be a good idea. I can do it if you wish. There is a dropseq mailing list as well that might be interested.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany