brian-cleary / LatentStrainAnalysis

Partitioning and analysis methods for large, complex sequence datasets
MIT License
37 stars 20 forks source link

Error in streaming SVD of abundance matrix #8

Open condomitti opened 8 years ago

condomitti commented 8 years ago

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting 'float division by zero' error doesn't matter what I do with input data. I was able to run the entire pipeline with test data but couldn't with my own set (Illumina MiSeq paired-end reads, organized in a single file in an interleaved fashion as generated by LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix 8 printing end of last log file... 9 self.add_documents(corpus) 10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 387, in add_documents 11 update = Projection(self.num_terms, self.num_topics, job, extra_dims=self.extra_samples, power_iters=self.power_iters) 12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 127, in init 13 extra_dims=self.extra_dims) 14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 742, in stochastic_svd 15 keep = clip_spectrum(s**2, rank, discard=eps) 16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 86, in clip_spectrum 17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0]) 18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error comes after Hashcounting is finished.

Thank you in advance.

Best, Condomitti.

brian-cleary commented 8 years ago

Hi,

Sorry for my slow response.

Are you running the distributed version, or the single instance version?

Do you mind sending me the output of "ls -l" for hashed_reads/ and cluster_vectors/? I think that will help me to diagnose the issue.

On Thu, Nov 19, 2015 at 12:54 PM, Condomitti notifications@github.com wrote:

Hello,

I've been trying to run LSA scripts with my own dataset but I'm getting 'float division by zero' error doesn't matter what I do with input data. I was able to run the entire pipeline with test data but couldn't with my own set (Illumina MiSeq paired-end reads, organized in a single file in an interleaved fashion as generated by LSFScripts/merge_and_split_pair_files.py).

This is the error LSA is printing out:

Starting streaming SVD of conditioned k-mer abundance matrix 8 printing end of last log file... 9 self.add_documents(corpus) 10 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 387, in add_documents 11 update = Projection(self.num_terms, self.num_topics, job, extra_dims=self.extra_samples, power_iters=self.power_iters) 12 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 127, in init 13 extra_dims=self.extra_dims) 14 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 742, in stochastic_svd 15 keep = clip_spectrum(s**2, rank, discard=eps) 16 File "/usr/lib/python2.7/dist-packages/gensim/models/lsimodel.py", line 86, in clip_spectrum 17 small = 1 + len(numpy.where(rel_spectrum > min(discard, 1.0 / k))[0]) 18 ZeroDivisionError: float division by zero

Is this an issue or am I doing something wrong? Apparently this error comes after Hashcounting is finished.

Thank you in advance.

Best, Condomitti.

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/8.

baravalle commented 8 years ago

Hi, I'm trying this on a different dataset but I get stuck on exactly the same error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors folders. Andres

ls -l hashed_reads/ total 18967940 -rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt -rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash -rw-r--r--. 1 root root 16777216 Feb 18 05:41 MET0432.count.hash.conditioned -rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy -rw-r--r--. 1 root root 19367708753 Feb 18 02:48 MET0432.prinseqoutput.hashq.gz -rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/ total 16388 -rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

brian-cleary commented 8 years ago

Hi Andres,

Is it the case that you have only a single sample there? The premise of LSA is to use covariance information across multiple samples, and the SVD step in particular will need multiple samples to work. I haven't tested the pipeline with a single sample to see if it generates this error, but certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle notifications@github.com wrote:

Hi, I'm trying this on a different dataset but I get stuck on exactly the same error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors folders.

Andres

ls -l hashed_reads/ total 18967940 -rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt -rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash -rw-r--r--. 1 root root 16777216 Feb 18 05:41 MET0432.count.hash.conditioned -rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy -rw-r--r--. 1 root root 19367708753 Feb 18 02:48 MET0432.prinseqoutput.hashq.gz -rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/ total 16388 -rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

— Reply to this email directly or view it on GitHub https://github.com/brian-cleary/LatentStrainAnalysis/issues/8#issuecomment-187146513 .

condomitti commented 8 years ago

Hi Andres and Brian, Sorry my late response at this time. I could manage to get to the final results by executing LSA steps separately rather than calling the single script as is in the sample page. Other than that nothing special was necessary. Take care, Condomitti. Em Seg, 2016-02-22 às 04:48 -0800, brian-cleary escreveu:

Hi Andres,

Is it the case that you have only a single sample there? The premise of LSA is to use covariance information across multiple samples, and the SVD step in particular will need multiple samples to work. I haven't tested the pipeline with a single sample to see if it generates this error, but certainly could be the case.

On Mon, Feb 22, 2016 at 7:08 AM, Andres Baravalle ub.com> wrote:

Hi, I'm trying this on a different dataset but I get stuck on exactly the same error. Did you manage to go past this?

Any suggestions?

I have included below the content of my hashed_reads and cluster_vectors folders.

Andres

ls -l hashed_reads/ total 18967940 -rw-r--r--. 1 root root 2 Feb 16 18:54 hashParts.txt -rw-r--r--. 1 root root 8388608 Feb 18 05:41 MET0432.count.hash -rw-r--r--. 1 root root 16777216 Feb 18 05:41 MET0432.count.hash.conditioned -rw-r--r--. 1 root root 30228672 Feb 18 05:41 MET0432.nonzero.npy -rw-r--r--. 1 root root 19367708753 Feb 18 02:48 MET0432.prinseqoutput.hashq.gz -rw-r--r--. 1 root root 50001 Feb 16 18:54 Wheels.txt

ls -l cluster_vectors/ total 16388 -rw-r--r--. 1 root root 16777296 Feb 18 05:41 global_weights.npy

— Reply to this email directly or view it on GitHub

ecomment-187146513> .

— Reply to this email directly or view it on GitHub.

baravalle commented 8 years ago

Hi Brian, Condomitti, thanks for your answers.

Brian, I'm coming to this from a computing background (not that familiar with LSA right now) as part of a multi-disciplinary team. Apparently you are right, the data we used as a test might have been from a single sample.

Will do a new test tomorrow, hopefully with the right data, and will ping back.

Thanks again for the help,

  Andres