cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Update runFeatureCounts in PipelineRnaseq.py #20

Closed sebastian-luna-valero closed 6 years ago

sebastian-luna-valero commented 6 years ago

Pipelines using cgat-core can now rely on TMPDIR being correctly set per job so it can be directly used to work with temporary folders. xref: https://github.com/cgat-developers/cgat-core/pull/30

ping @nickilott this PR is adding the changes commented in https://github.com/CGATOxford/CGATPipelines/pull/418 . I would be grateful if you could have a look. Would it be possible to run a test on your end to verify that it works as expected?

nickilott commented 6 years ago

Hi @sebastian-luna-valero, I will try and have a look at this tomorrow and get back to you

sebastian-luna-valero commented 6 years ago

Thanks Nick.

Please let me know if you need help installing the new code.

nickilott commented 6 years ago

Hi Sebastian,

What's the best way to install the new code without interfering with my current CGATOxford installation?

Thanks

Nick

sebastian-luna-valero commented 6 years ago

Hi Nick,

Sure. First of all, clean up your environment:

# deactivate conda and check with "which conda"
source deactivate

# get rid of loaded modules and check with "module list"
module purge

# empty PYTHONPATH and check with "printenv $PYTHONPATH"
unset PYTHONPATH

I plan to automate the checks above within the installer at some point.

Then run the installer:

# get the installer
curl -O https://raw.githubusercontent.com/cgat-developers/cgat-flow/master/install-CGAT-tools.sh

# install everything
bash install-CGAT-tools.sh --devel --git-ssh --location <your-path>/cgat-developers-v0

Please make sure that you have at least 15GB of disk available in <your-path>

If everything goes smoothly, you should get:

 The code successfully installed!

 To activate the CGAT environment type: 
 $ source <your-data>/cgat-developers-v0/conda-install/etc/profile.d/conda.sh
 $ conda activate base
 $ conda activate cgat-f

 To deactivate the environment, use:
 $ conda deactivate

If that's the case, then please go to this branch by doing:

# go to repo
cd <your-path>/cgat-flow

# checkout branch
git checkout --track origin/migration-418

Finally, please run the pipeline with:

cgatflow rnaseqdiffexpression make full

Best regards, Sebastian

nickilott commented 6 years ago

Hi Sebastian,

I got the following error:

An error occurred in:

Do I have to specify --location .

if installing into the current directory?

Thanks!

Nick

sebastian-luna-valero commented 6 years ago

Hi Nick,

I recommend to use --location /full/path/to/installation/folder

Also, the installer should have printed out a list of environment variables at the bottom after Debugging. If so, could you please paste those as well?

Best regards, Sebastian

nickilott commented 6 years ago

will give it a go with the full path - I had to move my computer and so don't have a copy of the environment variables. Will paste them if it fails this time.

Thanks

Nick

nickilott commented 6 years ago

Hi Sebastian,

The installation failed with the following:

##########################################################

An error occurred in:

Thanks for the help

Nick

nickilott commented 6 years ago

Hi Sebastian,

This was also in the output:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://repo.anaconda.com/pkgs/main/linux-64/mkl-2018.0.3-1.tar.bz2 Elapsed: -

An HTTP error occurred when trying to retrieve this URL. HTTP errors are often intermittent, and a simple retry will get you on your way.

sebastian-luna-valero commented 6 years ago

Thanks, Nick.

Please see:

 HTTP errors are often intermittent, and a simple retry will get you on your way.

That's annoying but the only solution is to remove the folder cgat-developers-v0 in /gfs/devel/nilott/cgat-developers-v0 and restart the installer again.

nickilott commented 6 years ago

Hi Sebastian,

So the code installed! I have attempted to run: cgatflow rnaseqdiffexpression make full

with the following error:

Task = def pipeline_rnaseqdiffexpression.runFeatureCounts(...): \ Job = [[biopsy-HEALTHY-R1.bam, geneset_all.gtf.gz] -> [featurecounts.dir/biopsy-HEALTHY-R1/transcripts.tsv.gz, featurecounts.dir/biopsy-HEALTHY-R1/genes.tsv.gz]] \ \ Traceback (most recent call last): \ File "/gfs/devel/nilott/cgat-developers-v0/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions \ register_cleanup, touch_files_only) \ File "/gfs/devel/nilott/cgat-developers-v0/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files \ ret_val = user_defined_work_func(*params) \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-flow/CGATPipelines/pipeline_rnaseqdiffexpression.py", line 690, in runFeatureCounts \ Quantifier.run_all() \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-flow/CGATPipelines/PipelineRnaseq.py", line 241, in run_all \ self.run_transcript() \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-flow/CGATPipelines/PipelineRnaseq.py", line 314, in run_transcript \ self.run_featurecounts(level="transcript_id") \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-flow/CGATPipelines/PipelineRnaseq.py", line 306, in run_featurecounts \ P.run(statement) \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-core/CGATCore/Pipeline/Execution.py", line 1328, in run \ benchmark_data = r.run(statement_list) \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-core/CGATCore/Pipeline/Execution.py", line 932, in run \ job_path) \ File "/gfs/devel/nilott/cgat-developers-v0/cgat-core/CGATCore/Pipeline/Execution.py", line 859, in collect_single_job_from_cluster \ job_id, retval.exitStatus, "".join(stderr), statement)) \ OSError: --------------------------------------- \ Job 1043406 exited with error code 1: \ The stderr was: \ /etc/bashrc: line 12: PS1: unbound variable \ /gfs/work/nilott/proj018/analysis/biopsies/test_cgatflow/ctmpr11n_p1a.sh: line 20: /scratch/slurm_1043406/tmp.omfAohZKRg/geneset.gtf: Not a directory \ \ zcat geneset_all.gtf.gz > $TMPDIR/geneset.gtf; featureCounts -Q 10 -T 4 -s 0 -a $TMPDIR/geneset.gtf -o featurecounts.dir/biopsy-HEALTHY-R1/transcripts.tsv.raw -g transcript_id biopsy-HEALTHY-R1.bam >& featurecounts.dir/biopsy-HEALTHY-R1/transcripts.tsv.gz.log; gzip -f featurecounts.dir/biopsy-HEALTHY-R1/transcripts.tsv.raw \ ----------------------------------------- \ \

nickilott commented 6 years ago

Hi Sebastian, From what I gather and correct me if I'm wrong, the way the code creates $TMPDIR is:

TMPDIR=`mktemp -p $SCRATCHDIR`
export $TMPDIR

This is enough in the run_featurecounts function to create the temporary geneset file i.e the statement should be changed from:

zcat genesetfile.gtf.gz > $TMPDIR/geneset.gtf; featurecounts ...

to

zcat genesetfile.gtf.gz > $TMPDIR; featurecounts ...

Removing the /geneset.gtf from the statement gets rid of the error and the output is as expected (although I do not have a test set for paired-end data and suspect the creation of bam_tmp will suffer from the same issue). I have to admit to finding the naming "TMPDIR" a little confusing in the code if indeed it is creating a file in SCRATCHDIR...

Does that all make sense?

nickilott commented 6 years ago

P.S sorry its taken so long to get around to this

sebastian-luna-valero commented 6 years ago

Hi Nick,

Many thanks for your help with this issue!

I think you need a special configuration for this to work properly in your cluster. Please see: https://github.com/cgat-developers/cgat-core/pull/30

@snsansom might be able to help you to get the cluster tmpdir correctly configured for your cluster.

If so, could you please share the relevant section of a working pipeline.yml so I can add it to the docs?

Best regards, Sebastian

nickilott commented 6 years ago

Thanks Sebastian, I will talk to Steve about this

Nick

nickilott commented 6 years ago

Hi Sebastian,

Steve and I found a bug when cluster_tmpdir is set. I have modified the code. Shall I submit a PR on this branch?

We have a global configuration file that sets cluster_tmpdir in /etc/cgat/pipeline.yml

cluster:
    tempdir: $SCRATCH_DIR

The pipeline runs without error now at my end.

Nick

sebastian-luna-valero commented 6 years ago

Thanks, Nick and Steve.

Nick, I think we could use this PR so just need to push the changes to this branch and I will test it at my end as well.

Best regards, Sebastian

nickilott commented 6 years ago

Actually Sebastian the bug is in cgat-core and not cgat-flow...Should I create a new branch and push the chagnes to that?

sebastian-luna-valero commented 6 years ago

Sure, please do!

nickilott commented 6 years ago

also should have been:

cluster:
    tmpdir: $SCRATCH_DIR

sorry!

sebastian-luna-valero commented 6 years ago

Thanks, Nick.

Could you please confirm which of the following is working well for you after the bug fix in cgat-core:

zcat genesetfile.gtf.gz > $TMPDIR/geneset.gtf; featurecounts ...

or

zcat genesetfile.gtf.gz > $TMPDIR; featurecounts ...

Best regards, Sebastian

nickilott commented 6 years ago

Hi Sebastian,

zcat genesetfile.gtf.gz > $TMPDIR/geneset.gtf; featurecounts ...

is working now so should be same as for CGAT setup

sebastian-luna-valero commented 6 years ago

That's great!

Thank you very much!