RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
254 stars 45 forks source link

Expected more than 1 value per channel when training #5

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

I'm getting the following error with a test run of vamb (commit: e70179c8d39496f49d0ff4d70e40d43876dc84f9):

Traceback (most recent call last):
  File "../vamb/run.py", line 311, in <module>
    logfile=logfile)
  File "../vamb/run.py", line 140, in main
    dropout, cuda, batchsize, nepochs, lrate, batchsteps, logfile)
  File "../vamb/run.py", line 91, in trainvae
    logfile=logfile, modelfile=modelpath)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 433, in trainmodel
    dataloader = self.trainepoch(dataloader, epoch, optimizer, batchsteps, logfile)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 262, in trainepoch
    depths_out, tnf_out, mu, logsigma = self(depths_in, tnf_in)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/vamb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 219, in forward
    mu, logsigma = self._encode(tensor)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 180, in _encode
    tensor = encodernorm(self.dropoutlayer(self.relu(encoderlayer(tensor))))
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/vamb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/vamb/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 49, in forward
    self.training or not self.track_running_stats, self.momentum, self.eps)
  File "/ebio/abt3_projects/software/dev/miniconda3_dev/envs/vamb/lib/python3.6/site-packages/torch/nn/functional.py", line 1191, in batch_norm
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size [1, 325]

The input is 10 bam files and a contig file: python ../vamb/run.py vamb_out coassemble_contigs_rn_cut.fna *.bam

I don't really understand the ValueError message. Any ideas on what's going on?

My conda env:

bcftools                  1.9                  h4da6232_0    bioconda
blas                      1.0                         mkl
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
certifi                   2018.11.29            py36_1000    conda-forge
cffi                      1.11.5           py36h5e8e0c9_1    conda-forge
cudatoolkit               9.0                  h13b8566_0
cudnn                     7.1.2                 cuda9.0_0
curl                      7.63.0               h74213dd_0    conda-forge
cython                    0.29.2           py36hfc679d8_0    conda-forge
htslib                    1.9                  hc238db4_4    bioconda
intel-openmp              2019.1                      144
krb5                      1.16.2               hbb41f41_0    conda-forge
libcurl                   7.63.0               hbdb9355_0    conda-forge
libdeflate                1.0                  h470a237_0    bioconda
libedit                   3.1.20170329         haf1bffa_1    conda-forge
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libssh2                   1.8.0                h5b517e9_3    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
mkl                       2018.0.3                      1
mkl_fft                   1.0.10                   py36_0    conda-forge
mkl_random                1.0.2                    py36_0    conda-forge
nccl                      1.3.5                 cuda9.0_0
ncurses                   6.1                  hfc679d8_2    conda-forge
ninja                     1.8.2                h2d50403_1    conda-forge
numpy                     1.15.0           py36h1b885b7_0
numpy-base                1.15.0           py36h3dfced4_0
openblas                  0.3.3                ha44fe06_1    conda-forge
openssl                   1.0.2p               h470a237_1    conda-forge
pigz                      2.3.4                         0    conda-forge
pip                       18.1                  py36_1000    conda-forge
pycparser                 2.19                       py_0    conda-forge
pysam                     0.15.1           py36h0380709_0    bioconda
python                    3.6.7                h5001a0f_1    conda-forge
pytorch                   0.4.0            py36hdf912b8_0
readline                  7.0                  haf1bffa_1    conda-forge
samtools                  1.9                  h8ee4bcc_1    bioconda
setuptools                40.6.3                   py36_0    conda-forge
sqlite                    3.26.0               hb1c47c0_0    conda-forge
tk                        8.6.9                ha92aebf_0    conda-forge
wheel                     0.32.3                   py36_0    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
zlib                      1.2.11               h470a237_3    conda-forge
jakobnissen commented 5 years ago

Hey @nick-youngblut

I've found the source of your problem: It seems that when n_sequences % batch_size == 1, the last minibatch has a lone observation, and that single-observation minibatched cannot be batch normalized. I have just pushed a small update (adding drop_last=True to the DataLoader in vamb.encode._make_dataloader) which should solve the issue. Can you verify that it works properly with the lastest commit?

Thanks for debugging this!

nick-youngblut commented 5 years ago

Thanks @jakobnissen for the quick fix! Your commit seems to have fixed that error, but now I'm getting a new one:

Traceback (most recent call last):
  File "../vamb/run.py", line 311, in <module>
    logfile=logfile)
  File "../vamb/run.py", line 135, in main
    rpkms = calc_rpkm(outdir, bampaths, mincontiglength, minalignscore, subprocesses, len(contignames), logfile)
  File "../vamb/run.py", line 56, in calc_rpkm
    logfile=logfile)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/parsebam.py", line 259, in read_bamfiles
    if not all(segment.has_tag("AS") for segment in segments):
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/parsebam.py", line 259, in <genexpr>
    if not all(segment.has_tag("AS") for segment in segments):
AttributeError: 'int' object has no attribute 'has_tag'

Maybe my bam files are formatted incorrectly?

jakobnissen commented 5 years ago

@nick-youngblut Oh dear! I made the rookie mistake of developing on master branch directly, and that mistake crept in. Fixed.

nick-youngblut commented 5 years ago

No worries. I pulled the updates to the master branch, and now I'm getting a new error:

Traceback (most recent call last):
  File "../vamb/run.py", line 311, in <module>
    logfile=logfile)
  File "../vamb/run.py", line 140, in main
    dropout, cuda, batchsize, nepochs, lrate, batchsteps, logfile)
  File "../vamb/run.py", line 91, in trainvae
    logfile=logfile, modelfile=modelpath)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 435, in trainmodel
    dataloader = self.trainepoch(dataloader, epoch, optimizer, batchsteps, logfile)
  File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 279, in trainepoch
    epoch_loss / len(data_loader),
ZeroDivisionError: division by zero

Sorry to keep throwing errors at you!

jakobnissen commented 5 years ago

This is definitely a less expected error. It seems that the number of sequences is less than the batch size. Can that be true?

If not, can I have you please make a dataloader with your dataset manually? There should be an tnf.npz and rpkm.npz file in your output directory. Import vamb in a Python session, load the files using vamb.vambtools.read_npz, then make the dataloader using vamb.encode.make_dataloader. When created, can you iterate over it?

Edit: I'm really appreciating this feedback, by the way. These bugs should definitely be straightened out ASAP. On the dev branch, i've just added a check that complains if you try to create a DataLoader with a batch size larger than the dataset.

nick-youngblut commented 5 years ago

For this test run, I'm using simulated metagenomes comprising just 10 samples with 2e6 paired-end 150 bp reads each.

I've loaded the files, and here's what I'm getting:

x = vamb.vambtools.read_npz('/ebio/abt3_projects/software/dev/llmga/tmp/vamb_test/vamb_out/rpkm.npz')
y = vamb.vambtools.read_npz('/ebio/abt3_projects/software/dev/llmga/tmp/vamb_test/vamb_out/tnf.npz')
z = vamb.encode.make_dataloader(x, y)
print(z)
# (<torch.utils.data.dataloader.DataLoader object at 0x7f5551bec9b0>, array([ True, False, False, ..., False, False, False]))
 len(z[1])
# 27551
jakobnissen commented 5 years ago

I'm at a loss here. It clearly looks like the length of the dataloader is zero. The length of the dataloader is len(dataset) // batch_size. The batch size doubles at certain epochs (the batchsteps, the -q flag in run.py), which by default there are only 4 of. So I conclude the error occurs if len(dataloader.dataset) < dataloader.batch_size * 2 ** len(batchsteps), But this should by default evaluate to 27551 < 64 * 2 ** 4 => 27551 < 1024 => False.

I'll look into it more deeply tomorrow morning European time.

jakobnissen commented 5 years ago

@nick-youngblut Update: I've just written tests for the encoding module of Vamb. I've encountered nothing that could cause this error you're describing. Does it fail immediately, or only after a certain number of epochs?

nick-youngblut commented 5 years ago

It seems to fail immediately

jakobnissen commented 5 years ago

Can you report what it says under "Creating and training VAE" and "Training VAE" in the logfile?

nick-youngblut commented 5 years ago

I should have really looked at the log file. I think that I found the problem:

Starting Vamb version 0.4
    Date and time is 2018-12-19 14:57:38.762502

Calculating TNF
    Processed 134978155 bases in 27551 contigs.
    Calculated TNF in 4.59 seconds.

Calculating RPKM
    Parsing 10 BAM files with 10 subprocesses.
    Processed  MG6.bam
    Processed  MG8.bam
    Processed  MG9.bam
    Processed  MG5.bam
    Processed  MG7.bam
    Processed  MG10.bam
    Processed  MG3.bam
    Processed  MG2.bam
    Processed  MG4.bam
    Processed  MG1.bam
    Calculated RPKM in 6.75 seconds.

Creating and training VAE
    Created VAE
    Created dataloader and mask
    Number of contigs : 27550
    Number of contigs remaining: 1

Training VAE
    Network properties:
    CUDA: False
    Alpha: 0.05
    Beta: 200.0
    Dropout: 0.2
    N hidden: 325, 325
    N latent: 40

    Training properties:
    N epochs: 500
    Batch size: 64
    Batchsteps: 25, 75, 150, 300
    Learning rate: 0.001
    N contigs: 1
    N samples: 10

Only 1 contig was used. I don't see why. The bam files include hits to other contigs. The contig naming in the fasta file is very plain (eg., "coassemble_0_0", "coassemble_0_1", or "coassemble_1_1"). Any idea why the other contigs were "unsuitable for encoding"?

jakobnissen commented 5 years ago

Aha. Sequences are deemed unsuitable for encoding if they have a depth of zero across all samples or if they contain no tetramers. You can see that in you RPKM and TNF matrix obtained by loading the .npz files as you manually did previously. The RPKM matrix should have rows summing to one. The TNF matrix should contain no rows with zeros. My bet is that something went wrong when parsing the BAM files, such that the reported RPKM is zero for all contigs but one. It also seems it could calculate the RPKM in 6.75 seconds - that's suspiciously quickly. Are the BAM files empty?

Edit: If the BAM files are not empty, then something went terribly wrong in the BAM parsing. Can you send me the BAM files, perhaps by uploading them somewhere, assuming they're not too big, or that they're not sensitive data?

Edit2: I've discovered a bug where, if multiple reference sequences (i.e. contigs) in the BAM file share the same name, all reads are counted towards only the last of them. I've fixed this on my local dev branch. However, for that bug to cause your problems, since all but 1 contig is unsuitable for encoding, that would mean all your references should have the same name in BAM file.

jakobnissen commented 5 years ago

@nick-youngblut I think I've determined the problem. The alignment scores in the BAM file you send me are very low (the highest being zero). You will notice that Vamb filters away any alignments below a certain score (set by the -s flag in run.py). Try running with -s 0.

nick-youngblut commented 5 years ago

Yep, that seems to have worked! I'm surprised that the alignments are so low, given that the reads were simulated from 100 genomes, so they should map well to the contigs from that metagenome assembly. Something to look into...

I'm using -p 10, but vamb is using all cores on our server (n=80). Do I have to change another parameter in order to drop the cpu usage?

jakobnissen commented 5 years ago

The -p argument controls the number of subprocess to spawn when reading the BAM-files. However, both the VAE training and encoding and the clustering is also implemented in parallel, and I have not thought to limit the number of processes there. I'll look into it :)

nick-youngblut commented 5 years ago

Thanks for the clarification! While setting cores=max_available is nice for getting software to run fast, it can make other users of the same server very angry when they see that someone is hogging all of the cpus

jakobnissen commented 5 years ago

Closing this issue because the original issue has been solved. Feel free to open another issue if you find another bug/problem. I'll work on limiting the number of CPUs for Numpy and PyTorch, but it seems it might not be doable for PyTorch (torch.set_num_threads simply doesn't work on my PyTorch 0.4.1`).

nick-youngblut commented 5 years ago

Just to follow up on the -s setting: the mapq scores can vary depending on the read mapper, so the default of -s 50 use by vamb will exclude all reads if bowtie2 issued (max score of 42)