Acribbs / tRNAnalysis

tRNA analysis workflow
MIT License
3 stars 1 forks source link

pipeline.yml: what is "cribbslab" variable? #11

Closed slobentanzer closed 5 years ago

slobentanzer commented 5 years ago

Hi Adam, I tried running your tutorial, at the actual run I had a ValueError: pipeline failed with 0 errors. I think it might be because of this exception:

python: can't open file '/ifs/projects/adam/cribbslab//python/tRNAscan2bed12.py': [Errno 2] No such file or directory

There might be other things going on, I am attaching the console output. However, my question is, what folder should the "cribbslab" variable point to? It is not self explanatory...

Kind regards, Sebastian

Acribbs commented 5 years ago

Hi sebastian,

Please can you re-download the tutorial because I actually modified it this morning as I realised I have left hard links in the pipeline.yml this morning.

In relation to your specific problem, that looks to be a bug that I hadn't spotted and can be patched in the following way:

git clone https://github.com/Acribbs/tRNAnalysis.git
# Change the cribbs lab variable in the pipeline.yml file from
cribbslab: '/ifs/projects/adam/cribbslab/'
# to
cribbslab: '<location to tRNAnalysis repo that was cloned>/tRNAnalysis/trnanalysis/'

I will fix this now in the next release. I have trnanalysis 0.1.3 that is ready to be merged into bioconda-recpipes (https://github.com/bioconda/bioconda-recipes/pull/15762), but will make this fix first.

Thanks, Adam

slobentanzer commented 5 years ago

hi adam, i downloaded the tutorial just now, it seems to be the old one. this one: https://www.cgat.org/downloads/public/adam/trnanalysis/test_trna.tar.gz

is there another one?

kind regards, sebastian

Acribbs commented 5 years ago

Yes thats the correct one, I just over wrote it at 10 this morning. Just in case you downloaded it earlier. I have just fixed bug and made new release in pypi and testing underway in conda. Thanks for spotting this. I suspected I would see issues when testing by others.

slobentanzer commented 5 years ago

naturally! thanks for the quick replies. I implemented your fix (I used the conda installation though, just added the git clone), and now the error is gone. however, it still does not work as intended. console out

the bowtie logs in mapping.dir complain about Could not locate a Bowtie index corresponding to basename "/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/hg38-chr.fa". I don't know if this is related or not, but the comment in the pipeline.yml states not to use "-" in the base name.

# The name of the genome (single word with no "_" or "-" and including a number, without the .fa. e.g. hg19, genome1) genome: hg38-chr

Acribbs commented 5 years ago

I actually fixed this in the latests version of the code to use a better regular expression (https://github.com/Acribbs/tRNAnalysis/commit/915ef686a86e23a3661ee9cc147dc9eb21486b75).

So this problem should be fixed.

However, I would rename the hg38-chr fa and bowtie indexes to just hg38 instead and see how you get on. Are you using the latest version of the code? Did you pip install or conda install?

slobentanzer commented 5 years ago

I installed via conda, but it was not working from the start, so I tried different things. I was not using the cloned repository up until now. which trnanalysis points to the miniconda3 directory.

So far the error persists, there seems to be a problem with bowtie. The log in mapping.dir still says Could not locate a Bowtie index corresponding to basename "/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/hg38.fa". (The file is there.) Console out

Acribbs commented 5 years ago

Yes this looks like you have an older version of the code.

The following between **:

gzip -dc downsample.dir/NORMAL_10KP_2373_miRNA1.fastq.gz > 
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpgcq42drf && bowtie -k 10 -v 2 --best --strata --sam  
**/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/hg38.fa**  
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpgcq42drf 2> mapping.dir/NORMAL_10KP_2373_miRNA1.bam_bowtie.log | samtools view -bS |                    samtools sort -T 
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpnkob6_9t -o mapping.dir/NORMAL_10KP_2373_miRNA1.bam &&                    samtools index mapping.dir/NORMAL_10KP_2373_miRNA1.bam

should be

gzip -dc downsample.dir/NORMAL_10KP_2373_miRNA1.fastq.gz > 
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpgcq42drf && bowtie -k 10 -v 2 --best --strata --sam  
**/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/hg38** 
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpgcq42drf 2> mapping.dir/NORMAL_10KP_2373_miRNA1.bam_bowtie.log | samtools view -bS |                    samtools sort -T 
/media/sebastian/1tb_sdd/Genomics/190304_stroke/alignment/test_trna/ctmpnkob6_9t -o mapping.dir/NORMAL_10KP_2373_miRNA1.bam &&                    samtools index mapping.dir/NORMAL_10KP_2373_miRNA1.bam

This should have been fixed with the regex in next updated code.

I am waiting for next version in bioconda to be merged. After that I think the best way forward would be to do the following:

conda create -n trnanalysis
conda activate trnanalysis
conda install -c bioconda trnanalysis=0.1.4

I will let you know when it is merged

slobentanzer commented 5 years ago

ok great, thanks so far! curious how it turns out!

Acribbs commented 5 years ago

new version has just been merged in bioconda. It may take a bit before it becomes available in package repo but you could try and conda search -c bioconda trnanalysis and if 0.1.4 is available then you are good to go

slobentanzer commented 5 years ago

great, thanks! I'll update if this solves it.

slobentanzer commented 5 years ago

hi adam, tried the new distro, it runs further than before, but it still throws an error undefined symbol: bam_read1. console output

Acribbs commented 5 years ago

Ah this is actually an issue with cgat-apps not trnanalysis. I think the latest build in conda ins broken (see https://github.com/cgat-developers/cgat-apps/issues/43). I will ask for it to be marked as broken.

This should be fixable with installing earlier version of cgat-apps:

conda install cgat-apps=0.5.3

So I can keep track of this issue I will make this a separate issue.

slobentanzer commented 5 years ago

hi adam (@Acribbs), as far as i can tell, bam_read1 is fixed by reverting to 0.5.3, but now it tries to find the tRNAScan2bed in the default conda directory, even though i changed the cribbslab variable to point to the cloned repository. console out

Acribbs commented 5 years ago

Thats strange, that it isn't picking up the file? Did you install trnanalysis manually or install using conda? If through conda, I wonder if your environment is confused.

Acribbs commented 5 years ago

The last time you installed, did you install in a clean environment using conda create -n <environment_name>?

Acribbs commented 5 years ago

I updated the conda for trnanalysis and the correct code version should be 0.1.4. It should not have the cribbslab variable in the pipeline.yml

slobentanzer commented 5 years ago

oh i see; yes i created a new environment, but the example already contains a pipeline.yml, with the cribbslab variable in it.

i checked by deleting the included pipeline.yml and running trnanalysis trna config (in the example), which gave me this error: ValueError: default config file['pipeline.yml']not found in ['/home/sebastian/Programs/miniconda3/envs/trnanalysis/lib/python3.6/site-packages/trnanalysis/pipeline_trna', '/home/sebastian/Programs/miniconda3/envs/trnanalysis/lib/python3.6/site-packages/trnanalysis/configuration']. i think it is the same problem as before, just that now i cannot tell the package where to look for the files. the conda installation does not seem to include all the files..?

are running from conda and running the cloned repository mutually exclusive?

Acribbs commented 5 years ago

I have had significant issues in the past with removing a package from my conda environment then trying to install the latest code from a cloned repo and it still pointing to previous version. I think conda does lots of caching and I haven't managed to work out the specifics. I ended up completely starting from scratch.

Therefore, I would completely start from scratch in a new conda environment. I have tested new conda version and it works on my OSX and linux system

slobentanzer commented 5 years ago

i am not familiar with conda, i just installed it for this package. which is good because i do not mind starting from scratch. could you advise on how to make sure it is gone before i install it anew?

Acribbs commented 5 years ago

Conda is an awesome project that has helped me solve so many dependency issues in the past. However, the project grew considerably and now it can take a while for the solver to fix dependency issues. Conda are aware of this and are working on strategies to improve conda, can you can see the issues here:

I have had to pin the conda package to cgat-apps==0.5.3 and now it takes ages for the conda SAT solver to find a solution to the installation. Until the issues with cgat-apps are resolved then I think the best way to install would be to follow the conda environment installation. I have now created a conda environment for linux that should work in the meantime. I tested this on rehat linux and it seems to work, although the installation did take a bit longer than I was expecting (20 mins or so), but this could be because i had just ran a conda clean.

To install the conda environment, just follow these steps:

wget https://raw.githubusercontent.com/Acribbs/tRNAnalysis/master/conda/environments/trnanalysis-linux.yml
conda env create -f trnanalysis-linux.yml 
conda activate trnanalysis-env
slobentanzer commented 5 years ago

hi adam (@Acribbs ), thanks for the effort! I removed conda completely to make sure no old dependencies were linked (with anaconda-clean) and then installed miniconda3 and your environment, and this time make full ran without errors.

at themake build_report stage, I now get another exception (undefined columns selected) from R. console out

is this step strictly necessary? or can I get the count data for the different RNA species without it?

slobentanzer commented 5 years ago

I think the problem with building the report is that the downloadable tutorial only includes three control samples. if I correct the metadata to contain only the samples in the tar, DESeq throws an error.

Acribbs commented 5 years ago

I think the problem was my fault, not all of the fastq files were included in the test data. I have just included them and ran test data, build_report should all work now

Acribbs commented 5 years ago

Thanks for all your help debugging this, most of these issues have been with dependancy problems in conda, but im glad I now have a linux and osx environment that work.

slobentanzer commented 5 years ago

don't mention it! I'll probably have further questions concerning the output once I have managed to run my samples.

for example, how can I access counts for the individual fragments (something along the lines of the output from MINTmap)? is that possible?

Acribbs commented 5 years ago

Yes you can retrieve the counts for each tRNA feature. I think the best way would be that once the report has generated, navigate to the Report.dir directory and then open up an rstudio session.

Then open the QC_alignment.Rmd file and you will be able to see the code that was used to generate the box plots for the fragments. You can run each individual cell and rather than plotting it, you can export it as a table.

My thinking of using Rmarkdown was that you can generate your own specific analysis on top of the standard analysis that the pipeline does. Instead of just using the build_report to build a static html website.

I think I am going to make a section within the read the docs to document the report output of the pipeline. I am also planning in the future to generate a lot more plots and figures that show coverage of the top 50 fragments and the show the major modifications across the whole tRNA fragment (I actually think Anna may already have generated some of the code for this and it is still waiting to be pushed).

This is a repo that is going to be constantly developed as I have some exiting scientific data and projects on tRNA. Sorry that your an early adopter of this but your help has been really useful. Let me know if there is anything else that I can help with.

slobentanzer commented 5 years ago

don't worry, I like participating in development.

I am also involved in several projects concerning tRNA, fragments, and miRs, so if you would like further input, I'd be happy to contribute.

slobentanzer commented 5 years ago

hi adam (@Acribbs ),

EDIT: for the sake of clarity, I am goint to make a new issue for this as well.

i finally had time to look at the results, found another bug (will make separate issue), but got it to work in the end. i am in the QC_alignment.Rmd, and i was wondering what the BED file columns were (they are custom, right?).

what i am basically looking for is the easiest way to get the sequence of each fragment using the coordinates from each BED. so is chromStart = V2, chromEnd = V3? count = V7? which reference should i use, and where do i find it in the folder?

thanks!

sebastian