brian-cleary / LatentStrainAnalysis

Partitioning and analysis methods for large, complex sequence datasets
MIT License
37 stars 20 forks source link

Error in Kmer Corpus step #6

Closed nmb85 closed 8 years ago

nmb85 commented 8 years ago

Hi! Firstly, this is a rad tool and I can't wait to see the results for my metagenomic time series. Thanks very much for developing it! I'm stuck at the Kmer Corpus step and my error log is giving me this traceback, which says that the matrices are not aligned:

Traceback (most recent call last):
  File "LSA/kmer_corpus.py", line 33, in <module>
    hashobject.kmer_corpus_to_disk(Kmer_Hash_Count_Files[fr],mask=M)
  File "/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 42, in kmer_corpus_to_disk
    norm = np.linalg.norm(H)/len(H)**.5
  File "/local/cluster/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 2060, in norm
    sqnorm = dot(x, x)
ValueError: matrices are not aligned

This is the only error message when the job dies within a fraction of a second of starting. I checked out the error message online and it's a generic problem with unmatched matrix/vector dimensions in the dotproduct. Any clues as to where that cropped up in my workflow? Thanks for your time!

brian-cleary commented 8 years ago

Hi,

Thanks for your feedback!

First - have you successfully run through the test data? If not, that's a good starting place just to make sure everything is installed and so on.

Is a bit hard to diagnose that problem without more info. Do you mind sending me the output of "ls -l" on original_reads/ hashed_reads/ and cluster_vectors? That would make it easier for me to tell what's going on.

On Mon, Nov 2, 2015 at 10:11 PM, russianconcussion <notifications@github.com

wrote:

Hi! Firstly, this is a rad tool and I can't wait to see the results for my metagenomic time series. Thanks very much for developing it! I'm stuck at the Kmer Corpus step and my error log is giving me this traceback, which says that the matrices are not aligned: Traceback (most recent call last): File "LSA/kmer_corpus.py", line 33, in hashobject.kmer_corpus_to_disk(Kmer_Hash_Count_Files[fr],mask=M) File "/LatentStrainAnalysis/LSA/streaming_eigenhashes.py", line 42, in kmer_corpus_to_disk norm = np.linalg.norm(H)/len(H)**.5 File "/local/cluster/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 2060, in norm sqnorm = dot(x, x) ValueError: matrices are not aligned This is the only error message when the job dies within a fraction of a second of starting. I checked out the error message online and it's a generic problem with unmatched matrix/vector dimensions in the dotproduct. Any clues as to where that cropped up in my workflow? Thanks for your time!

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/6.

nmb85 commented 8 years ago

Hi, Sorry for the delay and incomplete explanation. I attached the output from the ls -l commands for which you asked. I also attached the error files for each step, which were run multiple times and dumped into the same file. Sorry for the messy files; I included them for reference, just in case. Your code works well on the test data. I've tried LSA on two datasets that I have from freshwater lake metagenomes. One dataset has four metagenome samples, the other six: both are time series. One sample in each time series has a full lane of Illumina HiSeq 3000 data, which is more than 100 GB and the other samples are 10-20 GB each. I am able to get both datasets to the KmerCorpus step when it complains about the matrices not being aligned (as shown above). Does any of this information help diagnose the problem?

I should also mention that I'm converting your LSF scripts to SGE scripts, so I've been tinkering with the create_jobs.py script and customizing the qsub options. That customized file (which is incomplete, since I still haven't run through a complete run on my data) is also attached (as setupDirs.txt and create_jobs.txt). Very nice programming, by the way; it's modular, easy to read, and quick to edit. Some fun tricks, too. I'm learning a lot from reading your code.

ls_-lh_cluster_vectors.txt ls_-lh_hashed_reads.txt ls_-lh_original_reads.txt

create_jobs.txt setupDirs.txt

CombineFractions-Err.txt CombineFractions-Out.txt CreateHash-Err.txt CreateHash-Out.txt GlobalWeights-Err.txt GlobalWeights-Out.txt HashReads-Err.txt HashReads-Out.txt KmerCorpus-Err.txt KmerCorpus-Out.txt MergeHash-Err.txt MergeHash-Out.txt SplitInput-Err.txt SplitInput-Out.txt

brian-cleary commented 8 years ago

It's still a bit hard to tell where the problem is.

It looks like you had some errors in the very first step (SplitInput). Are you sure this step ran correctly? This step splits all of your input data into chunks that will be worked on in the individual distributed tasks downstream, and it expects a certain formatting of the input file names (see relevant section of docs below).

If you think that that ran correctly (despite the errors), then the next thing I would ask is to see the output of "ls -l" for hashed_reads/ and cluster_vectors/

Splitting the input files

Begin by splitting the original reads (from many samples) into many small files:

$ bsub < LSFScripts/SplitInput_ArrayJob.q

The purpose of this is to create many small files that can each be operated on by a single task in a distributed environment. The size of many job arrays downstream from this point are set by the number of chunks created in this step. Note that this code assumes the files are named sampleid..fastq.1 and sampleid..fastq.2 for paired reads. If you used some other naming convention, this needs to be reflected in line 26 in array_merge.py:

WARNING

If your input files are significantly different from paired fastq files separated into 2 parts (.fastq.1 and .fastq.2) plus a singleton file (.single.fastq.1), then you will either need to modify these python files, or just take it upon yourself to split your files into chunks containing ~1million reads each, and named like: sample_id.fastq.xxx, where ”.xxx” is the chunk number (eg ‘.021’)

On Sun, Nov 29, 2015 at 2:11 PM, russianconcussion <notifications@github.com

wrote:

Hi, Sorry for the delay and incomplete explanation. I attached the error files for each step, which were run multiple times and dumped into the same file. Sorry for the messy files; I included them for reference, just in case. Your code works well on the test data. I've tried LSA on two datasets that I have from freshwater lake metagenomes. One dataset has four metagenome samples, the other six: both are time series. One sample in each time series has a full lane of Illumina HiSeq 3000 data, which is more than 100 GB. I am able to get both datasets to the KmerCorpus step when it complains about the matrices not being aligned (as shown above). Does any of this information help diagnose the problem?

CombineFractions-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46698/CombineFractions-Err.txt CombineFractions-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46699/CombineFractions-Out.txt CreateHash-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46700/CreateHash-Err.txt CreateHash-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46702/CreateHash-Out.txt GlobalWeights-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46701/GlobalWeights-Err.txt GlobalWeights-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46697/GlobalWeights-Out.txt HashReads-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46703/HashReads-Err.txt HashReads-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46706/HashReads-Out.txt KmerCorpus-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46707/KmerCorpus-Err.txt KmerCorpus-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46705/KmerCorpus-Out.txt MergeHash-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46708/MergeHash-Err.txt MergeHash-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46704/MergeHash-Out.txt SplitInput-Err.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46709/SplitInput-Err.txt SplitInput-Out.txt https://github.com/brian-cleary/LatentStrainAnalysis/files/46710/SplitInput-Out.txt

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/6#issuecomment-160453819 .

nmb85 commented 8 years ago

Hi, Brian,

Thanks for the feedback. Many of those complaints from the bash shell occur on all our jobs (our admin hasn't bothered to fix them because they're innocuous). I'm pretty sure the split reads step worked properly and I followed your - very well written - instructions carefully. I will try again from scratch then send you the ls -l on the output directories. Thanks very much for your time; it's probably nearly impossible to troubleshoot from across the continent, but I appreciate your help.

nmb85 commented 8 years ago

Brian, I was able to resolve this by carefully heeding your warning: "If you used some other naming convention, this needs to be reflected in line 26 in array_merge.py" Thanks very much for pointing out the error; not sure why I missed that error message before. I appreciate your time!

brian-cleary commented 8 years ago

That's great! Sorry again for any of my slow responses.

Let me know if you run into any more issues.

On Mon, Dec 14, 2015 at 2:59 PM, russianconcussion <notifications@github.com

wrote:

Brian, I was able to resolve this by carefully heeding your warning: If you used some other naming convention, this needs to be reflected in line 26 in array_merge.py Thanks very much for pointing out the error; not sure why I missed that error message before. I appreciate your time!

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/6#issuecomment-164542528 .

nmb85 commented 8 years ago

No worries Brian, I did have the same issue as #7 . If you check over there, I posted all the relevant data I could find about the problem. Thanks again, man!