GarrettJenkinson / informME

An information-theoretic pipeline for methylation analysis of WGBS data
GNU General Public License v3.0
25 stars 8 forks source link

Output listing only chr1 #15

Open Lux91 opened 6 years ago

Lux91 commented 6 years ago

After running both "singleMethAnalysisToBed" and "diffMethAnalysisToBed", it gives me output files listing only results for chr1. I tried specifying as minchr 1 and maxchr 22 (even if it's the default option) but it still reports only results for chr1. I also tried forcing it to analyze the chr5 (just a random number different from chr1) and it gives me all empty files. Looking at the log provided by the tool while running there is no error and it says "Command succesful" at the end. The exact commands I used are: 1) for i in A C; do singleMethAnalysisToBed.sh --MC --ESI --MSI $i; done where A and C are my phenotypes (affected and control)

2) diffMethAnalysisToBed.sh --MC --ESI --MSI A C

I'm running everything on an Ubuntu Server 14.04.5 LTS (32 CPUs, 64GB RAM) -Samtools 0.1.19 -Matlab 2018a -informME 0.3.2 -g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4

thank you in advance for your help

jordiabante commented 6 years ago

Hi Lucas,

I will try to debug this with you. It looks like the previous step only generated files for the first chromosome. Can you tell me what command did you use with informME_run.sh?

Jordi

Lux91 commented 6 years ago

Hi, Since I wanted to run informME for every chromosome of 2 samples I did it in a very unelegant way (lol) for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22; do informME_run.sh -q 8 Sample_1 A $i; informME_run.sh -q 8 Sample_2 C $i; done I can confirm you every chr folder contains non empty mat files generated by informME_run.sh Lucas

jordiabante commented 6 years ago

Okay, so I understand that you submitted a single job for the for loop you shared with me? We recommend submitting an individual job per each chromosome for each sample. Otherwise, one might run into issues because of lack of time, memory, etc. Here is an example:

#!/bin/bash

# Input
mat_files="$1"
phenotype="$2"

# Vars
total_chr=22

# Process BAM file to generate methylation matrices
for ((chr=1; chr<=total_chr; chr++))
do

sbatch << EOJ
#!/bin/sh
#SBATCH --job-name=iME-hg19-RUN-${phenotype}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=100G
#SBATCH --time=30:00:00
#SBATCH -o /scratch/shared/logfiles/%u/IME-hg19-RUN_%A_${phenotype}_${chr}.out
#SBATCH -e /scratch/shared/logfiles/%u/IME-hg19-RUN_%A_${phenotype}_${chr}.err

# Dependencies
module load matlab/R2017a
module load samtools/0.1.19

# Run
informME_run.sh -q 6 --time_limit 30  ${mat_files} ${phenotype} ${chr}

EOJ

sleep 1

done

This submission script can be executed as follows:

./informME_submission.sh toy_sample_normal toy_normal

In any case, what are the sizes of the _fit.mat and a _analysis.mat files for each phenotype? Are these for chr1 much larger than for the rest of the chromosomes?

Lux91 commented 6 years ago

I'm not running it on a cluster, just on a regular server this is why I thought I could run it within a loop regular bash loop. Furthermore all the informME_run.sh instances ended correctly without errors, this is why I don't understand what happened. To answer your questions, the _fit.mat and _analysis.mat are actually small in all chromosomes except chr1.

jordiabante commented 6 years ago

I see, so that confirms that something went wrong in a previous step for the rest of the chromosomes. That is why both singleMethAnalysisToBed.sh and diffMethAnalysisToBed.sh are only generating output for chr1. Here are two things that I will ask from you:

  1. Can you do ls -lthr in the folder where the output from fastaToCpg.sh was stored? If you didn't use any optional commands, it should be in cd $REFGENEDIR once you have sourced your config file. You should see the following sizes (assuming you're working with hg19):
-rw-rw-r-- 1 user@jhu.edu group  11M Oct 31  2017 CpGlocationChr1.mat
-rw-rw-r-- 1 user@jhu.edu group  11M Oct 31  2017 CpGlocationChr2.mat
-rw-rw-r-- 1 user@jhu.edu group 7.8M Oct 31  2017 CpGlocationChr3.mat
-rw-rw-r-- 1 user@jhu.edu group 7.1M Oct 31  2017 CpGlocationChr4.mat
-rw-rw-r-- 1 user@jhu.edu group 7.2M Oct 31  2017 CpGlocationChr5.mat
-rw-rw-r-- 1 user@jhu.edu group 7.0M Oct 31  2017 CpGlocationChr6.mat
-rw-rw-r-- 1 user@jhu.edu group 7.4M Oct 31  2017 CpGlocationChr7.mat
-rw-rw-r-- 1 user@jhu.edu group 6.2M Oct 31  2017 CpGlocationChr8.mat
-rw-rw-r-- 1 user@jhu.edu group 5.8M Oct 31  2017 CpGlocationChr9.mat
-rw-rw-r-- 1 user@jhu.edu group 6.4M Oct 31  2017 CpGlocationChr10.mat
-rw-rw-r-- 1 user@jhu.edu group 6.1M Oct 31  2017 CpGlocationChr11.mat
-rw-rw-r-- 1 user@jhu.edu group 6.0M Oct 31  2017 CpGlocationChr12.mat
-rw-rw-r-- 1 user@jhu.edu group 3.9M Oct 31  2017 CpGlocationChr13.mat
-rw-rw-r-- 1 user@jhu.edu group 4.1M Oct 31  2017 CpGlocationChr14.mat
-rw-rw-r-- 1 user@jhu.edu group 4.1M Oct 31  2017 CpGlocationChr15.mat
-rw-rw-r-- 1 user@jhu.edu group 5.0M Oct 31  2017 CpGlocationChr16.mat
-rw-rw-r-- 1 user@jhu.edu group 5.3M Oct 31  2017 CpGlocationChr17.mat
-rw-rw-r-- 1 user@jhu.edu group 3.2M Oct 31  2017 CpGlocationChr18.mat
-rw-rw-r-- 1 user@jhu.edu group 4.8M Oct 31  2017 CpGlocationChr19.mat
-rw-rw-r-- 1 user@jhu.edu group 3.4M Oct 31  2017 CpGlocationChr20.mat
-rw-rw-r-- 1 user@jhu.edu group 1.8M Oct 31  2017 CpGlocationChr21.mat
-rw-rw-r-- 1 user@jhu.edu group 2.7M Oct 31  2017 CpGlocationChr22.mat
  1. Can you do ls -lthrR in the folder where the output from get_matrices.sh was stored? If you didn't use any optional commands, it should be in cd $INTERDIR once you have sourced your config file. For each chromosome, you should see a _matrices.mat for each sample. Let me know what sizes these have and if there are any that are smaller than 50M.
Lux91 commented 6 years ago

1) ls -lthr in $REFGENDIR -rw-rw-r-- 1 lucas lucas 11M giu 20 14:01 CpGlocationChr1.mat -rw-rw-r-- 1 lucas lucas 6,4M giu 20 14:09 CpGlocationChr2.mat -rw-rw-r-- 1 lucas lucas 6,2M giu 20 14:18 CpGlocationChr3.mat -rw-rw-r-- 1 lucas lucas 6,1M giu 20 14:26 CpGlocationChr4.mat -rw-rw-r-- 1 lucas lucas 4,0M giu 20 14:31 CpGlocationChr5.mat -rw-rw-r-- 1 lucas lucas 4,2M giu 20 14:37 CpGlocationChr6.mat -rw-rw-r-- 1 lucas lucas 4,2M giu 20 14:42 CpGlocationChr7.mat -rw-rw-r-- 1 lucas lucas 5,1M giu 20 14:49 CpGlocationChr8.mat -rw-rw-r-- 1 lucas lucas 5,5M giu 20 14:57 CpGlocationChr9.mat -rw-rw-r-- 1 lucas lucas 3,3M giu 20 15:01 CpGlocationChr10.mat -rw-rw-r-- 1 lucas lucas 4,9M giu 20 15:08 CpGlocationChr11.mat -rw-rw-r-- 1 lucas lucas 11M giu 20 15:22 CpGlocationChr12.mat -rw-rw-r-- 1 lucas lucas 3,5M giu 20 15:27 CpGlocationChr13.mat -rw-rw-r-- 1 lucas lucas 2,1M giu 20 15:29 CpGlocationChr14.mat -rw-rw-r-- 1 lucas lucas 2,9M giu 20 15:33 CpGlocationChr15.mat -rw-rw-r-- 1 lucas lucas 7,9M giu 20 15:44 CpGlocationChr16.mat -rw-rw-r-- 1 lucas lucas 7,1M giu 20 15:53 CpGlocationChr17.mat -rw-rw-r-- 1 lucas lucas 7,4M giu 20 16:03 CpGlocationChr18.mat -rw-rw-r-- 1 lucas lucas 7,1M giu 20 16:13 CpGlocationChr19.mat -rw-rw-r-- 1 lucas lucas 7,4M giu 20 16:23 CpGlocationChr20.mat -rw-rw-r-- 1 lucas lucas 6,2M giu 20 16:31 CpGlocationChr21.mat -rw-rw-r-- 1 lucas lucas 5,8M giu 20 16:39 CpGlocationChr22.mat

I used as fasta reference the GRCh38.76_primary_assembly

2) ls -lthrR in $INTERDIR -rw-rw-r-- 1 lucas lucas 385M giu 21 13:32 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 397M giu 21 16:16 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 178M giu 21 17:44 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 179M giu 21 19:07 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 179M giu 21 20:30 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 179M giu 21 21:49 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 175M giu 21 23:06 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 176M giu 22 00:20 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 121M giu 22 01:10 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 121M giu 22 02:00 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 121M giu 22 03:08 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 121M giu 22 04:10 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 111M giu 22 05:06 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 111M giu 22 05:59 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 119M giu 22 06:56 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 120M giu 22 07:50 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 84M giu 22 08:23 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 85M giu 22 08:56 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 101M giu 22 09:49 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 101M giu 22 10:40 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 89M giu 22 11:29 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 89M giu 22 12:13 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 179M giu 22 13:54 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 179M giu 22 15:26 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 65M giu 22 15:59 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 65M giu 22 16:32 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 39M giu 22 16:54 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 39M giu 22 17:16 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 45M giu 22 17:44 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 45M giu 22 18:10 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 104M giu 22 19:26 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 104M giu 22 20:33 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 101M giu 22 21:55 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 101M giu 22 23:05 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 99M giu 22 23:52 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 100M giu 23 00:38 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 80M giu 23 01:49 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 80M giu 23 02:50 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 84M giu 23 03:40 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 84M giu 23 04:27 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 46M giu 23 04:57 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 47M giu 23 05:25 Sample_B_matrices.mat -rw-rw-r-- 1 lucas lucas 39M giu 23 05:56 Sample_A_matrices.mat -rw-rw-r-- 1 lucas lucas 39M giu 23 06:25 Sample_B_matrices.mat (they are ordered from chr1 to chr22)

As you can see, some of them are smaller than 50M but the majority is not. I've already talked to Garrett about an issue while get_matrices.sh was running regarding samtools giving an error while parsing a few regions but he assured me that is not that important.

jordiabante commented 6 years ago

I see, alright. So the size of the .mat files for the reference look just fine. However, the size of the _matrices.mat files seems to be a bit smaller than what we usually see for WGBS data. Incidentally, are you using WGBS or RRBS data?

Going over your emails with Garrett, it seems that you tried running informME_run.sh for a single sample and just for a single chromosome. If you haven't, can you try so with say Sample_A and chromosome 11?

Also, can you let me know the sizes of the BAM file you used as input?

Lux91 commented 6 years ago

This is the size of my bam files: -rw-rw-r-- 1 lucas lucas 5,9G apr 1 22:02 Sample_A.bam -rw-rw-r-- 1 lucas lucas 5,5G apr 2 00:22 Sample_B.bam I'll try to run informME_run.sh for sampleA chr11 and I'll let you know what happens. As far as I know my samples are WGBS.

jordiabante commented 6 years ago

Sounds good.

So the sizes we usually see for human data are between 15-30G's. That usually translates into average coverages of x20. In our experience, WGBS data that is ~125bp paired-end reads at 15X to 20X coverage results in high quality models across most regions of interest in the genome.

Can you find out what kind of data you have? WGBS, RRBS? Paired-end, single-end? Read length?

Also, do you have more replicates for each phenotype?

Lux91 commented 6 years ago

Hi, so I asked to the lab that prepared the libraries and they told me that they followed the protocol provided by roches at this link: https://sequencing.roche.com/en/products-solutions/by-category/target-enrichment/hybridization/seqcap-epi-cpgiant-enrichment-kit.html I guess it's RRBS from what i read but they don't actually say it. (a lot of things now make sense, I wish they told me everything from the start) There's a target region of approximately 80.5Mln bases, capturing 5.5Mln CpGs (the target region covers all the genome "evenly") They've been sequenced in pair-end, every read has max length of 150, min length of 70 since I've pre-processed them before the alignment. In this data set I have 5 subjects, 2 affected and 3 not.

jordiabante commented 6 years ago

I see yes, it looks like some sort of targeted BS. We are not familiarized with this type of data, but as long as the reads are distributed uniformly within the target regions, and the regions are much larger than 3 kbp in size, then informME might still produce meaningful results (@GarrettJenkinson chime in if you think differently please). In fact, in the same link you shared with me they provide a way to download a BED file containing the targeted regions (130912_HG19_CpGiant_4M_EPI.bed). Can you download this file and look at the distribution in size of the regions?

If indeed, our assumptions of uniformly distributed reads in the modeled 3 kbp windows holds, then you can try using the 3 unaffected samples to build a single pooled model for the unaffected phenotype, and using the 2 affected samples to build a pooled model for the affected phenotype. Hopefully, in doing so, the coverage will triplicate and duplicate respectively, and informME might be able to model more regions. However, let's try to see how the targeted regions look like first.

GarrettJenkinson commented 6 years ago

Hi Lucas, so Jordi has asked all the right questions while I've been dealing with movers, etc. to get at the heart of your issue. I will chime in with my thoughts:

The fact that your data is RRBS/capture rather than WGBS is somewhat problematic for our pipeline, since it has been designed to work with WGBS coverage patterns. See our FAQs for a discussion of this issue. If you do decide to continue to use informME on this data, we would be curious as to how well it works for you, since there is a lot of RRBS data out there, but for the most part you are blazing new trail here.

The RRBS issue does not fully explain the problem you are having here, though, because it looks like your matrices files are large enough to produce some models on the other chromosomes if they produced models for chr1. So please let us know the results of your informME_run.sh on a single chromosome (if you are yet to do so, maybe chr2 is a good choice since the matrices files are bigger there). Remember to delete the "old" small ".mat" files that were produced by your previous informME_run submission, since informME will check if these files exist and not do anything if they find them there (which is helpful on clusters, but not so much in your case). One thing that concerns me is that you have listed a gcc compiler that is not compatible with your matlab version, see here:

https://www.mathworks.com/support/compilers.html

while this may not pose a problem (you did get results for chr1, which I would not expect if the binaries did not compile correctly), it very well could lead to strange/unpredictable behaviors. If the toy model does not work properly, then I can help you install a local gcc version compatible with matlab, since I did this on my ubuntu virtual machine that I use for testing informME.

Returning to the issue of coverage, it is possible in theory that the informME_run stage chooses not to build any models in a given chromosome for RRBS depending on how the data is distributed along the genome, but it seems unlikely to me that you would randomly have models built on chr1 and none of the others, unless the design of your RRBS has denser capture on chr1. Also I don't recall if you ever ran the "toy model" that tests your installation, but that could be helpful in turning up any issues with the install, etc.

We also understand, however, if you choose not to proceed, since the pipeline is not tailored to your data, just let us know what you decide so we can close this issue, etc.

Lux91 commented 6 years ago

Thank you guys, I would like to try using it since is the best tool (mathematically speaking) I've seen so far, so I'm going to try the toy data set and I'm also going to try building a model for my data for a single chromosome (just curious to see what happens). I'll reply as soon as I can with the details of the target region (segments length etc.). Regarding the compiler, I had my doubts when I started the installation but everything went fine and both the installations of matlab and informME didn't raised any warning or error. While I was thinking about the RRBS issue, I was wondering if instead of using the complete genome in the "fastaToCpG" step I could use a version that includes only the sequences covered by my target region. Do you think it could work? It would be interesting to see if informME mathematical approach works and its prediction are reliable in this case (probably Garrett already knows since he is super skilled in mathematics :) ).

GarrettJenkinson commented 6 years ago

Thanks for the compliments :) I hope informME will work for your data even though it is not tailored to it.

Regarding truncating the reference genome to be only the targeted regions: I think this would only work if you treated each contiguous targeted region as its own "chromosome", because otherwise, if you just removed gaps between regions, then you would have incorrect distances between CpG sites that come from two different regions, which would lead to inaccurate models. Treating each region as a "chromosome" would work just fine, but I imagine it would be too hackish and clunky to be worth it. Really, there are a lot of tweaks required to "do targeted/RRBS modeling right", which is why we don't recommend informME right now for it. What you will probably end up with is a few good models in locations where the targeting lines up nicely with informME's 3kb estimation windows, but you will probably miss a lot of CpG sites that do have targeted data. Please do update us with the target regions and any results you get from your testing, as we are interested in exploring this RRBS issue with you and getting you up and running.