Closed qinting closed 6 years ago
So my guess is that this error is occurring in matlab when trying to parse the output from samtools. This can happen when you don't have the software setup (samtools or matlab dependencies) or if the call to samtools fails (e.g., bam file not found or not indexed).
So some debugging steps:
(1) Verify that your copy of matlab has the symbolic math toolbox. Try opening a matlab prompt on the system where you are using informME and run the command:
vpa('1000')
and verify it doesn't throw an error.
(2) Verify that you have samtools installed and on your system path. At the bash shell on the machine using informME you should be able to run:
samtools view -q 30 -F 3328 -f 3 /path/to/yourBamFile.bam chr1:1-1000
and get the reads listed from chr1:1-1000.
(3) If all the above steps work, then verify your input directory (--bamdir argument) and file are correct. For example, the bam file should be sorted and indexed (i.e., the .bam and .bam.bai files should both be located in the input directory). Also be sure you have the --c_string argument set correctly to go with your reference genome's convention.
I've opened a pull request to have the code catch the most common sources of errors here and throw a more informative error when those problems arise. It may help your debugging if you take that updated file:
and use it to replace the file:
/path/to/informME/src/bash_src/parseBamFile/getMatrices/getMatrices.sh
in your installation directory on the machine using informME.
Thanks Garrett! I figured that the problem is that my bam files hasn't been sorted (the log file was attached here). However, the .mat file I got is empty. Do you know what the issue is? log.txt
`>> S = load('test_sorted_matrices.mat')
S =
struct with fields:
mapObjData: [0×1 containers.Map]`
Hey glad to help. Now for the current problem. I think I will need more information from you.
(1) You are getting warnings regarding your ~/.informME/informME.config file not being found and/or existing. Did you already go through the install.sh script? If so did you choose not to make this file with default settings? Generally, the usage is a little simpler with that file, although it is indeed optional so long as you are OK with default directories and/or manually passing all directory related arguments.
(2) Related to this, did you successfully analyze the reference genome using fastaToCpG.sh? This only needs to be done one time for each reference genome, but it needs to be completed correctly before you can move onto analyzing bam files.
(3) I am not seeing any true error messages in your log script. This leads me to believe that informME (mistakingly) believes there are no reads in your bam file. The most likely source of such a problem would be misspecification of the --c_string flag. This deals with the convention of your reference genome in the naming of chromosomes. If they are labeled chr1,chr2,..., then the option should be set to 1, if they are labeled 1,2,..., then you should set this option to 0. Can you grep your reference genome and tell me what you get:
grep "^>" /path/to/yourReferenceGenome.fasta
generally informME requires that the chromosome labels follow either of the two above conventions and are ordered in that way. Additionally, we do not support chromosomes other than the autosomes chr1,chr2,...,chrN at this time, so do not attempt to model partial contigs, mitochondrial chromosomes, or sex chromosomes.
(4) It would generally be helpful for debugging if you could provide the commands and/or cluster submission scripts you are using, since right now I have to mostly guess at your usage from the log files.
(5) InformME is designed to not "repeat" effort, so if it sees the output ".mat" file already exists, it will exit without rerunning any code. So moving forward, if you get a problematic ".mat" file, and plan to tweak something and re-run informME, you should first delete the problematic ".mat" file. For users on a cluster, this can be a convient feature, since jobs sometimes randomly fail (e.g., when a node goes down), and if you are running many things in parallel, this allows you to simply resubmit the same job and let informME figure out what work has already been done and what work remains to be done. The downside is it means debugging requires deleting these files each time until you find the source of the problem.
Just to help you a little bit, here is an example ~/.informME/informME.config file. You can create it yourself in vim/emacs/whatever, and again it can simplify your usage of informME and perhaps make debugging easier due to the ability to check one place for typos/problems versus every command you enter:
MPFRDIR="/scratch/groups/GarrettsGroup/shared/dependencies/"
EIGDIR="/scratch/groups/GarrettsGroup/shared/dependencies/eigen"
REFGENEDIR="/scratch/groups/GarrettsGroup/shared/data/hg19/reference"
BAMDIR="/scratch/groups/GarrettsGroup/shared/data/hg19/bam"
SCRATCHDIR="/scratch/groups/GarrettsGroup/informME_scratch/hg19"
INTERDIR="/scratch/groups/GarrettsGroup/shared/data/hg19/intermediate"
FINALDIR="/scratch/groups/GarrettsGroup/shared/data/hg19/output"
MATLICENSE="/cm/shared/apps/MATLAB/R2017a/licenses"
I am very impressed by your prompt reply! and very appreciate your help.
(1) Yes, I have the config file. For this test run, I want to make sure the program uses the input arguments, so I just renamed the config.file.
MPFRDIR=/home/qinting/softwares/informME/informMEdeps
EIGDIR=/home/qinting/softwares/informME/informMEdeps/eigen
SCRATCHDIR=/tmp
REFGENEDIR=/doubleshot/tingting/informME_refGenome/hg19
BAMDIR=/espresso/rcavalca/mint/projects/hnscc/bisulfite/bismark
INTERDIR=/espresso/tingting/Projects/HNSCC/informME/matFiles
FINALDIR=/espresso/tingting/Projects/HNSCC/informME/results
MATLICENSE=/opt/apps/rhel6/matlab/2018a/licenses/license_ones2_820543_R2018a.lic
(2) I think I successfully got all reference .mat files. For example:
>> s = load('CpGlocationChr16.mat')
s =
struct with fields:
CpGlocation: [873464×1 int64]
chrLength: 102531392
Dist: [873464×1 int64]
density: [873464×1 double]
finalCpGloc: 102521093
(3) The reference genome are labeled chr1, chr2..., and I set -c 1. But it does contain mitochondrial and sex chromosomes... So do we need to remove those chromosomes from my bam files or just deleted those reference mat files?
chrM chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
(4) I run the command on our server (Red Hat Enterprise Linux Server release 6.9 (Santiago))
nohup getMatrices.sh -r /doubleshot/tingting/informME_refGenome/hg19/ -b /home/qinting/espresso/tingting/Projects/HNSCC/informME/ -d /home/qinting/espresso/tingting/Projects/HNSCC/informME/matFiles/ -q 5 -c 1 -l /opt/apps/rhel6/matlab/2018a/licenses/license_ones2_820543_R2018a.lic --tmpdir ./tmp -- test_sorted.bam 16 &
(5) I did notice that and deleted all related output files before running the command.
Thanks, Tingting
(1) OK no problem. FYI, the software behaves as follows. If no config file and no argument is specified, default directories (relative paths from the caller's current directory) are used. If a config file specifies any of the folders, that will be used rather than the default. If a command-line argument is specified by the user, this is used rather than the config file or default selections. So you should be able to leave the config file in place, and overwrite any of those options by passing arguments from the script. If you do notice behavior that deviates from this, please let me know and I will correct the bug.
(2&3) Excellent! This explains things and it is my fault for not documenting things well enough here. My sincere apologies. And I have pushed updates to the main readme to clarify this issue, and am working on improving the error messages to be more informative. So let me explain what the problem is, and then how you can fix it. The reference fasta file is assumed to be ordered such that all autosomes come first and in their natural order: chr1,chr2,...,chr22 (in humans). It does not care what comes after the autosomes, nor their ordering, but since you should not try to model them in downstream analyses, we generally advise to trim the fasta file to only the autosomes so the fastaToCpG.sh does not do more work than is required (although trimming is not mandatory, since things will work just fine as long as the ordering is OK). Here since chrM came first, the software got confused and labeled it chr1, and so chrN is being represented by chr(N+1) in the mat files. You can verify from your output there that CpGlocationChr16.mat is giving you the length of chr15.
So you have three options going forward. Option 1, sort the fasta file and rerun the fastaToCpG.sh. Option 2, trim the fasta file and rerun the fastaToCpG.sh. Option 3, since it looks like everything is good, you can delete CpGlocationChr1.mat, and relabel CpGlocationChrX.mat to have X=X-1; after that you can feel free to leave or delete CpGlocationChrY.mat for Y>22. Option 3 will save you compute time and waiting for that step to finish again, so I suggest that route, unless you feel more comfortable "correctly" using the software from start to finish in which case either option 1 or 2 would be fine.
Also regarding your bam files: nothing is required there. It does not matter if they are aligned to a non-trimmed fasta file, just so long as the reference genome uses the same autosome naming scheme. InformME makes calls to your bam file through samtools view, and so it does not matter if there are reads that are never viewed.
For now, I am labeling this issue as an enhancement, since I would like to eventually handle reference fasta files more generally so as to avoid the sorting problem. Also I will leave it open in case other people run into similiar troubles.
But please reply back to let me know if you have further troubles or questions; or if everything starts working you can let me know that as well because I will be thrilled to have a new user of the code!
I renamed the reference mat files by following your option 3, but still my output mat file is empty.
getMatrices.sh -r /doubleshot/tingting/informME_refGenome/hg19/ -b /home/qinting/espresso/tingting/Projects/HNSCC/informME/ -d /home/qinting/espresso/tingting/Projects/HNSCC/informME/matFiles/ -q 20 -c 1 -l /opt/apps/rhel6/matlab/2018a/licenses/license_ones2_820543_R2018a.lic --tmpdir ./tmp -- test_sorted.bam 2
I double checked the length of chromosome 2 in the maf file and the fa file, and they are consistent.
>> s = load('CpGlocationChr2.mat', 'chrLength')
s =
struct with fields:
chrLength: 243199373
(the chr2 size on hg19 genome is 243199373)
I didn’t get any error messages when running getMatrices.sh. Maybe you can test on my bam file attached here? test_chr2_1to100000_sorted.bam.zip
Sorry to hear that you are still having troubles.
Your command looks well-formed to me, but can you try using an absolute path for your tmpdir rather than a relative path? The code should handle a relative path, but absolute paths are the best way to ensure everything is going exactly where we expect, especially if you are in a cluster environment.
Also, can you give me some more information about test_sorted.bam and the difference between it and the attached test_chr2_1to100000_sorted.bam.zip? What is the average depth of these WGBS reads? The attachment does not have a .bai file included so I cannot run informME without some samtools preprocessing (which I can do, but there are better debugging options first, see below). Remember that all bam files must be sorted and have their indexed bai file in the same folder when informME accesses them. Additionally, most hg19 bam files would be too large for email anyhow, and if it is small enough for email, then that might explain why you are getting empty files (since informME only stores information over CpG sites in regions with sufficient CpG density to be modeled). From the file name, it appears these reads are only over a segment of the chr2, and perhaps there is not sufficient data or CpG sites in this region/file for informME to produce any output. Maybe you can try on a full scale file?
If you don't want to do a large run and are trying to run small scale test examples, please see the "toy" models that we include with the software. Namely, you can test your informME install on a small scale example using the toy model and the "main.sh" testing scripts inside each bash_src directory. Specifically, you can test getMatrices.sh as follows (this advice comes from my co-author Jordi, who is the guru that came up with this nice testing example in our package):
cd informME/src/bash_src/parseBamFile/getMatrices/main
mv out/ out_real
./main.sh
then the contents of out_real are what you would expect from a working installation, and the newly created contenst of out are what your installation produced. Can you run that at report back?
The test_chr2_1to100000_sorted.bam.zip only contains reads covering chr2: 1-100000, while test_sorted.bam contains reads across the genome (I attached the first one here due to the size limit).
I actually tested on the complete bam file (test_sorted.bam) and got the empty result.
Ah, I was looking for the example data and cannot find it. Thanks for point it out. I will test on it and let you know if the problem is due to our data itself (hopefully not...)
I got this strange error:
ERROR: input bam file:
/home/qinting/softwares/informME/src/bash_src/parseBamFile/getMatrices/main/input/toy_cancer_pe
is not readable
Should the full file name including .bam be specified in the main.sh file?
Hmm...it appears that this error could be a result of my PR that is attempting to catch and throw more informative errors, which I think helped us initially in this investigation, but now is adding a confounding variable. Sorry to keep having you switch, but can you replace this file with the one off of github master branch:
/path/to/informME/src/bash_src/parseBamFile/getMatrices/getMatrices.sh
and try again? The difference between these files clearly did not affect things in the "real" file, since you would have seen a similiar error there, and instead had no error. But I have been making changes on that PR branch, and I don't want that adding additional unknowns to our debugging.
It seems to work fine now. The output files sizes are comparable to the ones in out_real folder.
-rw-r--r--. 1 qinting users 72K Apr 19 20:24 out/chr1/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 19 20:24 out/chr2/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 19 20:25 out/chr3/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 67K Apr 19 20:25 out/chr4/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 19 20:25 out/chr5/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 72K Apr 19 20:25 out/chr1/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 19 20:26 out/chr2/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 19 20:26 out/chr3/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 67K Apr 19 20:27 out/chr4/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 19 20:27 out/chr5/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 17 10:30 out_real/chr2/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 72K Apr 17 10:30 out_real/chr1/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 72K Apr 17 10:30 out_real/chr1/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 17 10:30 out_real/chr3/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 70K Apr 17 10:30 out_real/chr3/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 17 10:30 out_real/chr2/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 17 10:30 out_real/chr5/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 69K Apr 17 10:30 out_real/chr5/toy_cancer_pe_matrices.mat
-rw-r--r--. 1 qinting users 66K Apr 17 10:30 out_real/chr4/toy_normal_pe_matrices.mat
-rw-r--r--. 1 qinting users 67K Apr 17 10:30 out_real/chr4/toy_cancer_pe_matrices.mat
So something wrong with my data? The average depth on chromosome 1 of my test bam file is ~10. What is the minimum average depth required by informME? I checked your toy input data is ~15. Is it the minimum?
OK, so I agree that it seems that your installation is working fine, because if you had a problem with samtools/informME/etc then you would not have been able to replicate those toy results. Which brings us back to (a) usage in your real data or (b) your real data itself as the culprit.
To have a fully clean testing environment, I have setup a virtual machine where I will install a fresh copy of informME and try to run your sample through. This will be better than testing in my "production" pipeline, where things have been tweaked for that specific cluster, etc.
10X data should be more than sufficient to give non-empty results. With 10X data you get dropout due to random variations in coverage (and informME deciding not to model regions with insufficient data, rather than spitting out "garbage" models in very low coverage regions), but you should definitely be getting a lot of "matrices" built in this step from 10X data. We generally see 15-20X resulting in nearly complete modeling of the genome, and so for best results I suggest >15X, but as long as you are OK with dropout of some regions, 10X will still provide a lot of regions with fully reliable results.
So I am going to do some testing with the bam file you provided, and I am also going to confer with my colleagues to see if they have any thoughts on why you are experiencing issues here.
Can you tell me the version number and the commands you used in bismark to produce your bam file? That might help me in my debugging as well.
OK I found the problem. Your data is not paired-end sequencing. The core matlab code is written to explicitly handle either single-end or paired-end reads, but since all of our data is paired-end, we overlooked having a user-setable flag in the shell script to select single ended reads. Let me touch base with my co-author to work on adding that flag back into the user-interface. We will fix it quickly and make a v0.3.1 release that should work just fine for your data.
But as a scientific note, informME will work best with (a) longer sequencing reads and (b) paired-end sequencing reads. Since informME is based on trying to measure correlations along the sequencing reads (both ends of them when paired), it will inherently struggle in sparse regions where two CpG sites are rarely observed on the same sequencing read. So when I said 15-20X generally provides fully modeled genomes, this was referring to ~125bp paired end sequencing reads.
Thanks for catching that. Unfortunately, our data is only 50-bp single end sequencing reads. Do you think it is still worth a try?
Jordi and I are actively working (see the PR #7 for details) on making sure you can give it a try ASAP (thank you for your patience and feedback, we welcome any comments/critiques on documentation or usage that you think could be improved). I think it is definitely worth a try, but let me expand upon the limitations so you will know how to interpret the results if you go to write a paper:
The power of informME versus standard marginal processing (e.g., where each CpG is treated as independent Bernoulli's) is that we actively consider the single-cell information that a WGBS read provides when it observes two-or-more CpG sites simultaneously on a single read. In your data, we will never observe simultaneously two CpG sites that are more than 50bp apart, so that advantage is reduced. But in this case, we would do no worse than marginal modeling, just we lose that extra edge. Where your data will shine is in CpG islands where the spacing between CpG sites is small. However, your read-depth might limit the number of islands informME chooses to model, because more CpG sites means that gaps in the data make estimation more difficult, and informME is cautious about what it chooses to model. We would rather you have gaps in modeling along the genome, but have full confidence in any regions that do have models.
One thing we suggest when data-depth is lower is that you try building "pooled" models. So for example, if you have two normals and two cancers, rather than building 4 models, you can build a pooled normal and compare it against your pooled cancer. The pooled models then have effectively twice the depth. A caveat is that the models now do not represent individual samples, but a population of samples. So high entropy in the population, for example, could be due to heterogeneity of the samples rather than high entropy in both samples. But as long as you are aware of this distinction when you draw your scientific conclusions there is no problem and often people are indeed interested in the population difference between a normal and cancerous population. In informME, building pooled models is easily handled at the informME_run.sh stage, where you can specify multiple mat files when building a model.
Alright @qinting, we have merged our PR onto the master branch. You can now specify single-ended reads through the "-p 0" flag in the getMatrices.sh step. So can you please do a git pull to update your code to the latest version? I believe you will not need to reinstall anything, but the pull will bring in the new code.
We would very much appreciate you trying it out to see that it works for you. I am also going to wait for feedback from you before tagging the master branch as v0.3.1, since I want to include any improvements/fixes/documentation clarifications before formally tagging the release.
You may want to back-up all compiled mex code from your installation:
cp informME/src/matlab_src/*mex* ~/path/to/backup/folder/
putting these files back in the matlab_src directory is all that is required to complete the installation on a fresh github clone, assuming you do not delete/change the informME_deps folder with your Eigen/MPFR/etc files that were created during the installation process.
Ok, the program is running, and the temporary mat output files are not empty. I will keep you updated. Thanks for all the help and explanations!
Wonderful, thank you for your patience and persistence. We welcome any further feedback/criticism that may help us improve the user experience.
I wonder why the reads need to be trimmed during getMatrices step, as shown in your example command.
Hi Qinting,
This is Jordi, the other contributor to this repository. In the toy example we want to show how to use optional arguments (the ones preceding the '--' symbol ) and that is why we trim the reads. In addition, it is customary to do so since the quality of the read decays in its extremes. Granted, this is no problem when the reads are long enough. However, if the reads are shorter then one might not want to do that.
In any case, it is not mandatory to so, and in fact, if no argument -t is passed, then the default behavior is not to trim. To learn more about the mandatory vs. optional arguments, as well as default behavior, you can type the name of the function. Just entering
getMatrices.sh
will give you all the information related to this function. Let us know if you have any other doubts or if we can help in any other way.
Jordi
Jordi covered the topic exceptionally. Just to chime in about the scientific aspects a bit: you usually want to construct an M-bias plot using Bismark (or other similiar tools), which will tell you if your read quality is low in the first few base-pairs, e.g.,
https://www.biostars.org/p/259801/
It is very common, for example, in paired-end reads to have the second read be lower quality over the first 5-10 bp. In such a case, you would want to tell informME through the -t argument, to trim off those problematic basepairs.
I agree with Jordi that in your data with single-ended, relatively short reads, you probably would want to stick to the default of no trimming, unless the M-bias plot displayed very serious read quality problems in the first few bp of your data.
Thank you for the feedback @qinting, and I hope we clarified the trimming issue. I also updated the help file for getMatrices.sh to demonstrate the default usage, which aims to further clarify to users that trimming is indeed optional.
Has the remainder of the pipeline worked smoothly for you? Do you have any further suggestions for clarification? In the near future I will probably tag a v0.3.1 release, but want to be sure we consider any of your suggestions before doing so. Thanks again for helping us to improve things here.
Thanks Garrett! It totally makes sense. Yes, the pipeline worked smoothly on my test sample. Currently I am running getMatrices.sh on all samples. I will let you know if I have any questions.
OK thanks @qinting, I have tagged the v0.3.1 release. I will also close this issue as resolved by #7 and the v0.3.1 release. If you experience any further problems please open a new issue.
Hi, I failed to run getMatrices.sh and got the error message: "Error using symengine Unable to convert expression into double array." I didn't see symengine installed in the informMEdeps folder. Could you please help me fix this issue?