Closed nick-youngblut closed 5 years ago
Hey @nick-youngblut
I've found the source of your problem: It seems that when n_sequences % batch_size == 1, the last minibatch has a lone observation, and that single-observation minibatched cannot be batch normalized. I have just pushed a small update (adding drop_last=True to the DataLoader in vamb.encode._make_dataloader) which should solve the issue. Can you verify that it works properly with the lastest commit?
Thanks for debugging this!
Thanks @jakobnissen for the quick fix! Your commit seems to have fixed that error, but now I'm getting a new one:
Traceback (most recent call last):
File "../vamb/run.py", line 311, in <module>
logfile=logfile)
File "../vamb/run.py", line 135, in main
rpkms = calc_rpkm(outdir, bampaths, mincontiglength, minalignscore, subprocesses, len(contignames), logfile)
File "../vamb/run.py", line 56, in calc_rpkm
logfile=logfile)
File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/parsebam.py", line 259, in read_bamfiles
if not all(segment.has_tag("AS") for segment in segments):
File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/parsebam.py", line 259, in <genexpr>
if not all(segment.has_tag("AS") for segment in segments):
AttributeError: 'int' object has no attribute 'has_tag'
Maybe my bam files are formatted incorrectly?
@nick-youngblut Oh dear! I made the rookie mistake of developing on master branch directly, and that mistake crept in. Fixed.
No worries. I pulled the updates to the master branch, and now I'm getting a new error:
Traceback (most recent call last):
File "../vamb/run.py", line 311, in <module>
logfile=logfile)
File "../vamb/run.py", line 140, in main
dropout, cuda, batchsize, nepochs, lrate, batchsteps, logfile)
File "../vamb/run.py", line 91, in trainvae
logfile=logfile, modelfile=modelpath)
File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 435, in trainmodel
dataloader = self.trainepoch(dataloader, epoch, optimizer, batchsteps, logfile)
File "/ebio/abt3_projects/software/dev/llmga/tmp/vamb/encode.py", line 279, in trainepoch
epoch_loss / len(data_loader),
ZeroDivisionError: division by zero
Sorry to keep throwing errors at you!
This is definitely a less expected error. It seems that the number of sequences is less than the batch size. Can that be true?
If not, can I have you please make a dataloader with your dataset manually? There should be an tnf.npz
and rpkm.npz
file in your output directory. Import vamb in a Python session, load the files using vamb.vambtools.read_npz
, then make the dataloader using vamb.encode.make_dataloader
. When created, can you iterate over it?
Edit: I'm really appreciating this feedback, by the way. These bugs should definitely be straightened out ASAP. On the dev branch, i've just added a check that complains if you try to create a DataLoader with a batch size larger than the dataset.
For this test run, I'm using simulated metagenomes comprising just 10 samples with 2e6 paired-end 150 bp reads each.
I've loaded the files, and here's what I'm getting:
x = vamb.vambtools.read_npz('/ebio/abt3_projects/software/dev/llmga/tmp/vamb_test/vamb_out/rpkm.npz')
y = vamb.vambtools.read_npz('/ebio/abt3_projects/software/dev/llmga/tmp/vamb_test/vamb_out/tnf.npz')
z = vamb.encode.make_dataloader(x, y)
print(z)
# (<torch.utils.data.dataloader.DataLoader object at 0x7f5551bec9b0>, array([ True, False, False, ..., False, False, False]))
len(z[1])
# 27551
I'm at a loss here. It clearly looks like the length of the dataloader is zero. The length of the dataloader is len(dataset) // batch_size
. The batch size doubles at certain epochs (the batchsteps
, the -q
flag in run.py), which by default there are only 4 of. So I conclude the error occurs if len(dataloader.dataset) < dataloader.batch_size * 2 ** len(batchsteps)
, But this should by default evaluate to 27551 < 64 * 2 ** 4 => 27551 < 1024 => False.
I'll look into it more deeply tomorrow morning European time.
@nick-youngblut Update: I've just written tests for the encoding module of Vamb. I've encountered nothing that could cause this error you're describing. Does it fail immediately, or only after a certain number of epochs?
It seems to fail immediately
Can you report what it says under "Creating and training VAE" and "Training VAE" in the logfile?
I should have really looked at the log file. I think that I found the problem:
Starting Vamb version 0.4
Date and time is 2018-12-19 14:57:38.762502
Calculating TNF
Processed 134978155 bases in 27551 contigs.
Calculated TNF in 4.59 seconds.
Calculating RPKM
Parsing 10 BAM files with 10 subprocesses.
Processed MG6.bam
Processed MG8.bam
Processed MG9.bam
Processed MG5.bam
Processed MG7.bam
Processed MG10.bam
Processed MG3.bam
Processed MG2.bam
Processed MG4.bam
Processed MG1.bam
Calculated RPKM in 6.75 seconds.
Creating and training VAE
Created VAE
Created dataloader and mask
Number of contigs : 27550
Number of contigs remaining: 1
Training VAE
Network properties:
CUDA: False
Alpha: 0.05
Beta: 200.0
Dropout: 0.2
N hidden: 325, 325
N latent: 40
Training properties:
N epochs: 500
Batch size: 64
Batchsteps: 25, 75, 150, 300
Learning rate: 0.001
N contigs: 1
N samples: 10
Only 1 contig was used. I don't see why. The bam files include hits to other contigs. The contig naming in the fasta file is very plain (eg., "coassemble_0_0", "coassemble_0_1", or "coassemble_1_1"). Any idea why the other contigs were "unsuitable for encoding"?
Aha. Sequences are deemed unsuitable for encoding if they have a depth of zero across all samples or if they contain no tetramers. You can see that in you RPKM and TNF matrix obtained by loading the .npz
files as you manually did previously. The RPKM matrix should have rows summing to one. The TNF matrix should contain no rows with zeros.
My bet is that something went wrong when parsing the BAM files, such that the reported RPKM is zero for all contigs but one. It also seems it could calculate the RPKM in 6.75 seconds - that's suspiciously quickly. Are the BAM files empty?
Edit: If the BAM files are not empty, then something went terribly wrong in the BAM parsing. Can you send me the BAM files, perhaps by uploading them somewhere, assuming they're not too big, or that they're not sensitive data?
Edit2: I've discovered a bug where, if multiple reference sequences (i.e. contigs) in the BAM file share the same name, all reads are counted towards only the last of them. I've fixed this on my local dev
branch. However, for that bug to cause your problems, since all but 1 contig is unsuitable for encoding, that would mean all your references should have the same name in BAM file.
@nick-youngblut I think I've determined the problem. The alignment scores in the BAM file you send me are very low (the highest being zero). You will notice that Vamb filters away any alignments below a certain score (set by the -s
flag in run.py
). Try running with -s 0
.
Yep, that seems to have worked! I'm surprised that the alignments are so low, given that the reads were simulated from 100 genomes, so they should map well to the contigs from that metagenome assembly. Something to look into...
I'm using -p 10
, but vamb is using all cores on our server (n=80). Do I have to change another parameter in order to drop the cpu usage?
The -p argument controls the number of subprocess to spawn when reading the BAM-files. However, both the VAE training and encoding and the clustering is also implemented in parallel, and I have not thought to limit the number of processes there. I'll look into it :)
Thanks for the clarification! While setting cores=max_available
is nice for getting software to run fast, it can make other users of the same server very angry when they see that someone is hogging all of the cpus
Closing this issue because the original issue has been solved. Feel free to open another issue if you find another bug/problem.
I'll work on limiting the number of CPUs for Numpy and PyTorch, but it seems it might not be doable for PyTorch (torch.set_num_threads
simply doesn't work on my PyTorch 0.4.1`).
Just to follow up on the -s
setting: the mapq scores can vary depending on the read mapper, so the default of -s 50
use by vamb will exclude all reads if bowtie2 issued (max score of 42)
I'm getting the following error with a test run of
vamb
(commit: e70179c8d39496f49d0ff4d70e40d43876dc84f9):The input is 10 bam files and a contig file:
python ../vamb/run.py vamb_out coassemble_contigs_rn_cut.fna *.bam
I don't really understand the ValueError message. Any ideas on what's going on?
My conda env: