dieterich-lab / rp-bp

Rp-Bp is a Bayesian approach to predict, at base-pair resolution, ribosome occupancy and translation.
MIT License
7 stars 5 forks source link

STAR bam symlink problems #51

Closed mhassan closed 7 years ago

mhassan commented 7 years ago

While running the process-all-samples pipeline, STAR alignment completes successfully and creates the symlink as expected. However, the immediate steps thereafter can not recognize the bam file pointing to the symlink. How do I make this work?

bmmalone commented 7 years ago

Hi,

Thanks for checking out our software. Could you please run it in debug mode, and upload the log file:

process-all-samples c-elegans-test.yaml --overwrite --num-cpus 2 --logging-level DEBUG --log-file log.txt

Hopefully, the log file (log.txt) will shed some light on what's going on. It is also possible that something is printed to the screen. If you see anything that looks problematic, please also post that. Thanks.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, I tried this and this is the valid error msg at which point the program aborts. The problem is with the symlink file "extracellular-rep-2.test.bam" even though its created, it's not detected in the next step. Musa Feb 13 21:13:12 ..... started STAR run Feb 13 21:13:12 ..... loading genome Feb 13 21:13:54 ..... processing annotations GTF Feb 13 21:13:55 ..... inserting junctions into the genome indices Feb 13 21:14:24 ..... started mapping Feb 13 21:27:10 ..... started sorting BAM Feb 13 21:27:14 ..... finished successfully DEBUG misc.utils 2017-02-13 21:27:16,005 : Checking file for validity: rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam INFO root 2017-02-13 21:27:16,006 : samtools view -h rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam > /dev/null INFO root 2017-02-13 21:27:16,006 : calling [E::hts_open_format] fail to open file 'rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam' samtools view: failed to open "rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam" for reading: No such file or directory WARNING misc.shell_utils 2017-02-13 21:27:16,016 : The command returned a non-zero return code samtools view -h rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam > /dev/null Return code: 1 CRITICAL misc.utils 2017-02-13 21:27:16,017 : The bam/sam file does not appear to be valid: rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam WARNING misc.utils 2017-02-13 21:27:16,017 : Rename invalid file: rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam to rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam.invalid Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/bin/create-base-genome-profile", line 11, in load_entry_point('rpbp', 'console_scripts', 'create-base-genome-profile')() File "/Applications/rp-bp/rpbp/orf_profile_construction/create_base_genome_profile.py", line 183, in main file_checkers=file_checkers, overwrite=args.overwrite, call=call) File "/Applications/rp-bp/src/misc/misc/utils.py", line 703, in call_if_not_exists os.rename(filename, invalid_filename) FileNotFoundError: [Errno 2] No such file or directory: 'rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam' -> 'rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam.invalid' Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/bin/create-orf-profiles", line 11, in load_entry_point('rpbp', 'console_scripts', 'create-orf-profiles')() File "/Applications/rp-bp/rpbp/orf_profile_construction/create_orf_profiles.py", line 140, in main overwrite=args.overwrite, call=True) File "/Applications/rp-bp/src/misc/misc/utils.py", line 681, in call_if_not_exists ret_code = check_call(cmd, call=call, raise_on_error=raise_on_error) File "/Applications/rp-bp/src/misc/misc/utils.py", line 539, in check_call return check_call_step(cmd, call=call, raise_on_error=raise_on_error) File "/Applications/rp-bp/src/misc/misc/utils.py", line 523, in check_call_step raise subprocess.CalledProcessError(ret_code, cmd) subprocess.CalledProcessError: Command 'create-base-genome-profile Simon/Ribosome_profile/Digested_intracellular_R1.fq c-elegans-test.yaml extracellular-rep-2 --num-cpus 2 --logging-level DEBUG --stdout-logging-level NOTSET --log-file log.txt --stderr-logging-level NOTSET --file-logging-level NOTSET --star-executable STAR ' returned non-zero exit status 1 Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/bin/run-rpbp-pipeline", line 11, in load_entry_point('rpbp', 'console_scripts', 'run-rpbp-pipeline')() File "/Applications/rp-bp/rpbp/run_rpbp_pipeline.py", line 129, in main utils.check_call(cmd) File "/Applications/rp-bp/src/misc/misc/utils.py", line 539, in check_call return check_call_step(cmd, call=call, raise_on_error=raise_on_error) File "/Applications/rp-bp/src/misc/misc/utils.py", line 523, in check_call_step raise subprocess.CalledProcessError(ret_code, cmd) subprocess.CalledProcessError: Command 'create-orf-profiles Simon/Ribosome_profile/Digested_intracellular_R1.fq c-elegans-test.yaml extracellular-rep-2 --num-cpus 2 --logging-level DEBUG --stdout-logging-level NOTSET --file-logging-level NOTSET --log-file log.txt --stderr-logging-level NOTSET --star-executable STAR ' returned non-zero exit status 1 Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.4/bin/process-all-samples", line 11, in load_entry_point('rpbp', 'console_scripts', 'process-all-samples')() File "/Applications/rp-bp/rpbp/process_all_samples.py", line 145, in main job_id = slurm.check_sbatch(cmd, args=args) File "/Applications/rp-bp/src/misc/misc/slurm.py", line 182, in check_sbatch shell_utils.check_call(cmd, call=call) File "/Applications/rp-bp/src/misc/misc/shell_utils.py", line 89, in check_call return check_call_step(cmd, call=call, raise_on_error=raise_on_error) File "/Applications/rp-bp/src/misc/misc/shell_utils.py", line 72, in check_call_step raise subprocess.CalledProcessError(ret_code, cmd) subprocess.CalledProcessError: Command 'run-rpbp-pipeline Simon/Ribosome_profile/Digested_intracellular_R1.fq c-elegans-test.yaml extracellular-rep-2 --num-cpus 2 --file-logging-level NOTSET --logging-level DEBUG --stderr-logging-level NOTSET --log-file log.txt --stdout-logging-level NOTSET --star-executable STAR ' returned non-zero exit status 1

bmmalone commented 7 years ago

Hi Musa,

You are right; It seems that the symlink is not created correctly to rpbp_out/without-rrna-mapping/extracellular-rep-2.test.bam. Could you please check if the standard STAR output files are there? In particular, there should be "rpbp_out/without-rrna-mapping/extracellular-rep-2.testAligned.sortedByCoord.out.bam".

===

Actually, it looks like you are running on OSX. Previously (#35), we found a problem where the "--readFilesCommand" option for STAR needs to be adjusted from ubuntu to OSX. You can read through that issue for details, but, basically, the relevant command is "zcat" in ubuntu and "gzcat" in OSX.

You could try creating a symlink from gzcat to zcat and placing it in $PATH, but I'm not sure how well that would work.

The dev branch allows you to specify it at the command line. You can switch to dev (and re-install just to make sure the updated programs are used) using these commands in the rp-bp main folder (with setup.py):

git checkout dev
pip3 install .

You can then run the pipeline and specify the --star-read-files-command for the correct "read a gzipped file" program; however, it will attempt to guess the name based on the operating system, and the guessing seemed to work for OSX.

If you do this, you will also need to re-run the prepare genomes program; however, in the dev branch, it is now called prepare-rpbp-genome, and the main pipeline script is run-all-rpbp-instances, so the commands should be (with paths updated based on your local filenames):

prepare-rpbp-genome WBcel235.79.chrI.yaml --num-cpus 2 --mem 4G --overwrite --logging-level INFO
run-all-rpbp-instances c-elegans-test.yaml --overwrite --num-cpus 2 --logging-level INFO --merge-replicates --star-read-files-command gzcat --keep-intermediate-files

Could you please give this a go and let me know how it works? we are merging dev into master very soon (hopefully this week), so the documentation, etc., will be updated to correctly reflect these changes.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, Thanks for the update. I already sorted out the OSX gzcat ->zcat problem and the STAR sorted bam output is there. I'll try somethings and let you know if I find a solution. Regards, Musa

bmmalone commented 7 years ago

Hi Musa,

Okay; after thinking a bit more, there are some other things that should be pretty easy to check...

In the output above, it looks like the raw data file (Simon/Ribosome_profile/Digested_intracellular_R1.fq) is not gzipped. I don't think we ever tested this. Do the file sizes of the other intermediate files (e.g., rpbp_out/without-adapters, rpbp_out/without-rrna) look reasonable?

Does the pipeline work with the example c-elegans data?

Also, could you try specifying the full paths in the config file?

Please just let me know how these look. I will also try running with relative paths and unzipped files and see how that works on our (debian-based) cluster.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, Thanks for your response and suggestions. I have thus far switched to the 'dev' and re-installed rpbp and the 'prepare-rpbp-genome' runs fine. The STAR command runs fine and generate all the BAM files as expected, plus the symlink. However, something is still wrong with the symlink. I have tried to perform independent commands (unrelated to rpbp) on the symlink, just to see if the problem is within rpbp or with the symlink and am afraid it's with the symlink. It's like it doesn't exist. By the way all these are based on the sample data. I am trying now to manually create the symlink and see if that works. Will let you know. Musa

bmmalone commented 7 years ago

Hi Musa,

I was able to (possibly) reproduce the error. I used a relative path in the config file:

riboseq_data: rpbp_out

Then, I get very similar-looking error messages:

INFO     root     2017-02-14 12:02:07,135 : calling
Feb 14 12:02:07 ..... started STAR run
Feb 14 12:02:07 ..... loading genome
Feb 14 12:02:10 ..... processing annotations GTF
Feb 14 12:02:10 ..... inserting junctions into the genome indices
Feb 14 12:02:35 ..... started mapping
Feb 14 12:02:38 ..... started sorting BAM
Feb 14 12:02:38 ..... finished successfully
INFO     root     2017-02-14 12:02:38,657 : samtools view -h rpbp_out/without-rrna-mapping/c-elegans-rep-1.testAligned.sortedByCoord.out.bam > /dev/null
INFO     root     2017-02-14 12:02:38,657 : calling
INFO     root     2017-02-14 12:02:38,689 : samtools index -b rpbp_out/without-rrna-mapping/c-elegans-rep-1.test.bam
INFO     root     2017-02-14 12:02:38,689 : calling
[E::hts_open_format] fail to open file 'rpbp_out/without-rrna-mapping/c-elegans-rep-1.test.bam'
samtools index: failed to open "rpbp_out/without-rrna-mapping/c-elegans-rep-1.test.bam": No such file or directory

The problem is that the symlink that is created to the STAR alignment file uses this as the target, so the link ends up being broken. Does this look like what is happening in your case? If so, could you please add the full paths to the config file and see how things go? Thanks.

If this is the problem, then I will update the code to handle this case.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, That solved the symlink problem. Unfortunately, I have raised another error msg when executing run-all-rpbp-instances. Here is the relevant part of the msg

INFO root 2017-02-14 11:53:28,597 : estimate-metagene-profile-bayes-factors /Applications/rp-bp/c-elegans-chrI-example/Test/metagene-profiles/c-elegans-rep-1.test-unique.metagene-profile.csv.gz /Applications/rp-bp/c-elegans-chrI-example/Test/metagene-profiles/c-elegans-rep-1.test-unique.metagene-periodicity-bayes-factors.csv.gz --num-cpus 2 --periodic-models --nonperiodic-models --periodic-offset-start -20 --periodic-offset-end 0 --metagene-profile-length 21 --seed 8675309 --chains 2 --iterations 500 --file-logging-level NOTSET --logging-level INFO --stderr-logging-level NOTSET --stdout-logging-level NOTSET INFO root 2017-02-14 11:53:28,598 : calling usage: estimate-metagene-profile-bayes-factors [-h] [--periodic-models PERIODIC_MODELS [PERIODIC_MODELS ...]] [--nonperiodic-models NONPERIODIC_MODELS [NONPERIODIC_MODELS ...]] [--periodic-offset-start PERIODIC_OFFSET_START] [--periodic-offset-end PERIODIC_OFFSET_END] [--metagene-profile-length METAGENE_PROFILE_LENGTH] [-s SEED] [-c CHAINS] [-i ITERATIONS] [-p NUM_CPUS] [--type-field TYPE_FIELD] [--count-field COUNT_FIELD] [--position-field POSITION_FIELD] [--log-file LOG_FILE] [--log-stdout] [--no-log-stderr] [--logging-level {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--file-logging-level {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--stdout-logging-level {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--stderr-logging-level {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}] metagene_profiles out estimate-metagene-profile-bayes-factors: error: argument --periodic-models: expected at least one argument

I suspect this has something to do with the "models folder" Musa

bmmalone commented 7 years ago

Hi Musa,

I'm glad to hear that fixed the symlink problem. I believe the simplest fix will be for the symlink to just point to the STAR file (without any sort of path prepended). I will try that in the next day or two, but in the meantime, just sticking with full paths will be safest.

The new issue is definitely a problem related to the models folder. This also came up before (#36); it should be fixed in the dev branch, but this seems to be more fragile than I'd hoped. Can you please check if there is anything in the folder ~/Library/Application\ Support/rpbp/rpbp_models/? There should be folders like periodic, etc., and then pkl files within those folders.

If those are empty/not present, the easiest is to just reinstall (on the dev branch) using:

pip3 install --verbose --log log.rpbp-install.txt 

In particular, towards the end of that, there should be some lines like:

INFO : pickle-stan rpbp_models/nonperiodic/no-periodicity.stan '~/Library/Application\ Support/rpbp/rpbp_models/nonperiodic/no-periodicity.pkl'

Could you please let me know if those folders are there, and if not, how the installation goes? In particular, the pickle-stan lines show exactly where the models are going.

Have a good day, Brandon

bmmalone commented 7 years ago

Hi Musa,

I just wanted to let you know I've updated the master branch to include the various changes for handling paths, etc. They symlink issue is still open, though.

Once you get a chance, please let me know if the models appear in the correct place. Thanks.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, All the issues with symlink and models are now resolved with this new update and everything runs fine with the example data. I still get a syntax error though when I get to "estimate_orf_bayes-factors.py". I've looked at the offending line (353) in the script, but before I hack it, I'd like to know your intentions here. Below is the error msg:

File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/rpbp/translation_prediction/estimate_orf_bayes_factors.py", line 353 (group, *args) for name, group in groups ^ SyntaxError: can use starred expression only as assignment target Thanks, Musa

bmmalone commented 7 years ago

Hi Musa,

I can reproduce this error; it is due to a difference between python3.4 and 3.5. I am now looking into how to fix this.

Have a good day, Brandon

P.S. The notation essentially means to create a tuple that has "group" and everything in "args" in it, so the size of the tuple would be (1+len(args)).

bmmalone commented 7 years ago

Hi Musa,

I believe I have fixed the bit of code related to the problem; it was actually dead code, and a more robust implementation was available in our misc library. I just ran through the example using python 3.4.3 in a virtual environment, and everything seems to work. Could you please give it a try? I bumped the version, so you should just need to re-install with "--upgrade", but the changes are just in the dev branch:

git checkout dev
git pull
pip3 install --verbose --upgrade -r requirements.txt

should pull the necessary changes. If everything is updated, then this command:

 python3 -c "import rpbp; print(rpbp.__version__)"

should print 1.1.2.

Please let me know if that addresses things. If so, I'll merge the changes into master. Thanks for your help.

Have a good day, Brandon

mhassan commented 7 years ago

Hi Brandon, That worked perfectly without a single hitch. Thanks for solving this.

I kept seeing this msg printed out, though I have no idea if it is anything important:

"Informational Message: The current Metropolis proposal is about to be rejected because of the following issue: Exception thrown at line 65: cauchy_log: Scale parameter is 0, but must be > 0! If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine, but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."

On a different note: Apart from the usual e.g paths, files etc, is there a file/script that is sample specific and should be modified for different samples i.e. not c.elegans? I need to run my own samples now that everything is running fine. Regards, Musa

bmmalone commented 7 years ago

Hi Musa,

Great! I'm glad everything is going smoothly for the example now.


The "Informational Message" bit is directly output by pystan (Issue #10). The related issue from the Stan repo suggests this is fixed on their end, but despite some amount of trying, I have not been able to suppress it in rpbp. I also tried setting up pipes to /dev/null and several other things. If you have some free time, pull requests are definitely welcome :)

It is nothing to worry about, though; while the message is printed many, many times, it is generally only printed once or twice for each ORF (or metagene profile, depending on which part of the pipeline is running).


For running your own samples, only the two config files need to be updated (replaced, etc.). There is a bootstrap-ribo-analysis script which is installed which can automatically generate the sample-related parts of the "experiment" config file (c-elegans-test.yaml in the example) from a relatively normal sample sheet (csv or excel). It is not documented, yet, but the script does include a (hopefully) informative description with bootstrap-ribo-analysis --help. It also creates some symlinks and things.

Depending on what other data and pipelines you have available, you may also be interested in augmenting standard (ensembl, etc.) annotations with a de novo assembly. I just updated the docs to describe how this works in more detail.

There are also a large number of analysis scripts available in the rp-bp/rpbp/analysis folder. I plan to document (or at least list) those in the next few days, as well. For the most part, --help gives at least a semi-informative message about what the scripts do.


I will merge this change into master in the morning and close this issue. If you run into any other issues, please let us know. In particular, parsing annotations is always fragile, so please just post an issue if anything seems amiss. Thanks.

Have a good day, Brandon