RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
259 stars 46 forks source link

vamb ValueError: TNF row at index 160128 is all zeros. #276

Closed ChaoXianSen closed 11 months ago

ChaoXianSen commented 11 months ago

Dear @sgalkina pipeline : $vamb --outdir ${outdir} \ --fasta ${fname}.contigs_headNoSpace.fa \ --bamfiles ${fname}_sort_changed_header.bam

The error is as follows:

Traceback (most recent call last): File "/public/home/bioinfo_wang/00_software/miniconda3/envs/avamb/bin/vamb", line 33, in sys.exit(load_entry_point('vamb', 'console_scripts', 'vamb')()) File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 1387, in main run( File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 768, in run data_loader = vamb.encode.make_dataloader( File "/public/home/bioinfo_wang/00_software/vamb/vamb/encode.py", line 113, in make_dataloader raise ValueError( ValueError: TNF row at index 160128 is all zeros. This implies that the sequence contained no 4-mers of A, C, G, T or U, making this sequence uninformative.This is probably a mistake. Verify that the sequence contains usable information (e.g. is not all N's)

All of the other samples that I've run are going to work, but the only two samples that I've run are going to go wrong, What went wrong ?

Looking forward to your reply !

jakobnissen commented 11 months ago

Have you checked the DNA/RNA sequence number 160128 in your FASTA file, as indicated by the error message? Does its content look normal?

ChaoXianSen commented 11 months ago

the bam file number 160128 : 6ab74268df09184ef98dd87eb5a6db1

its content look normal.

jakobnissen commented 11 months ago

Which version of Vamb are you running?

ChaoXianSen commented 11 months ago

b6882532ea4ba38fd85d966c318a2df

jakobnissen commented 11 months ago

Also, the BAM record you posted contains the 160128th read, whereas the error pertains to the 160128th contig. Can you check that in the FASTA file?

ChaoXianSen commented 11 months ago

test.txt cf834fecf123344952338454f6a4db9

jakobnissen commented 11 months ago

Ah, that file contains contigs shorter than 2 kbp, which are filtered away. Is there an output file called "contignames"? In that, find the 160128th contig name, and use the name of that contig to find the sequence. Sorry for the hassle!

ChaoXianSen commented 11 months ago

pipeline : vamb --outdir ${fname} \ --fasta ${fname}.contigs_headNoSpace.fa --bamfiles ${fname}_sort_changed_header.bam

'${fname}.contigs_headNoSpace.fa' is the raw contig (assembly for megahit, contains sequecnes shorter than 2 kbp),

I jsut want to use vamb to get the file 'cluster.tsv' to further analysis ( vamb -> PHAMB);

Sample F183 , program runs with an error , not produced the file 'contignames'. 5f0212b8b7d3448c54b27c2d40da7bf

But, another sample F1, the file 'contignames' like this :It doesn't seem to delete contigs shorter than 2 kbp: 5c82311006cf6957b1529aba75e6455

jakobnissen commented 11 months ago

Okay. I still need to see that contig 160128 looks good. Can you run the following code in the directory which failed, such that it has access to the file composition.npz?

import vamb
comp = vamb.parsecontigs.Composition.load("composition.npz")
N = 160128
print(comp.metadata.lengths[N])
print(comp.metadata.identifiers[N])
print(comp.matrix[N])

That should print the contig name (and some other info I'm interested in). Then, given the contig name XXX, you can do grep "^>XXX" -A 2 my_contigs.fasta

ChaoXianSen commented 11 months ago

76f7381c35fa1fc12c8f58b6632613d e40cff446b0851633549458e2bab566 524d698f2e0becea986967842a208ff

its content look normal.

jakobnissen commented 11 months ago

Okay, I found the bug! This is indeed a bug in Vamb and has nothing to do with your particular sequence. It just so happens that the vector you print has a sum that is exactly zero, and this causes a bug in Vamb. I'll push a fix ASAP.

ChaoXianSen commented 11 months ago

1b4ce92d3f9f485eaceaeb8c24b4321

through the log file, It doesn't seem to Creating and training VAE~

ChaoXianSen commented 11 months ago

Okay, I found the bug! This is indeed a bug in Vamb and has nothing to do with your particular sequence. It just so happens that the vector you print has a sum that is exactly zero, and this causes a bug in Vamb. I'll push a fix ASAP.

so that's how matters stand, thanks for your reply. Thank you very much for your help !

ChaoXianSen commented 11 months ago

I have another question , the running speed of avamb seems to slowly, What can I do to speed process up ? pipeline : vamb --outdir ${fname} --fasta ${fname}.contigs_headNoSpace.fa --bamfiles ${fname}_sort_changed_header.bam

jakobnissen commented 11 months ago

Your best bet would be to use a GPU, and set --cuda when running. This will speed up training and clustering quite a bit. What step in particular is slow? You can check the log file.

ChaoXianSen commented 11 months ago

Okay, I found the bug! This is indeed a bug in Vamb and has nothing to do with your particular sequence. It just so happens that the vector you print has a sum that is exactly zero, and this causes a bug in Vamb. I'll push a fix ASAP.

2643b02539d01b55736542b0e0a5a5e the vector is this comp.matrix[N] ? sum(comp.matrix[N]) does not seem to be equal to zero ?

ChaoXianSen commented 11 months ago

Your best bet would be to use a GPU, and set --cuda when running. This will speed up training and clustering quite a bit. What step in particular is slow? You can check the log file.

2f9a2219f07c959862944c5b5e9cc8b

f487b12fecda74c7caa52fc97a5a88d the process of Creating and training VAE is slow, which can I improve ? look forward for your reply again !

jakobnissen commented 11 months ago

sum(comp.matrix[N]) does not seem to be equal to zero ?

Hmm... it's possible that this is because it's computed slightly differently in Vamb, so there might be some rounding error where it may return either 3.79e-9 or 0.0, depending on the exact order of the floating point operations.

ChaoXianSen commented 11 months ago

sum(comp.matrix[N]) does not seem to be equal to zero ?

Hmm... it's possible that this is because it's computed slightly differently in Vamb, so there might be some rounding error where it may return either 3.79e-9 or 0.0, depending on the exact order of the floating point operations.

OKOK ,i get it. thanks a lot !