Closed jolespin closed 3 years ago
For running with different hyperparameters you can basically try what you want and see what works for you. We suggest using a slightly smaller and slightly larger network and that is a good starting point. Feel free to try other combinations - we have seen that depending on the dataset different hyperparameters might be a bit better than others.
For the -n parameter you should use it as size of the hidden layers, ie. -n 512 512 will give you two hidden layers of 512 neurons each. You can try any combination of neurons in the layers and the number of layers.
For jgi_summarize_bam_contig_depths you should be able to use it as default for input with --jgi
. There were some issue with this in the past so let us know if it does not work. If you look at the snakemake file in the workflow folder you can see that we change it to the rpkm
format where first column is contig-name and the remaining columns then the abundances from each sample.
Thanks, that makes a lot of sense with the -n parameter. I took a look at the code for --jgi
and it appears to do all the work necessary in removing the unnecessary columns. I unfortunately couldn't get the GPU installation working on our server so the multibinning (which I was very excited to use) was going to take too long w/ the hyperparameter tuning. Instead I'm running it per sample using a pretty extensive grid. Going to run DASTool after to get the best combinations. I'll post my results here in case it's useful for future version hyperparameter tuning or someone perusing around.
We highly recommend the multi-binning strategy - so I would recommend that you at least do one run of that - you can combine the bins afterwards with DASTool etc.
Thanks for the heads up. What I'm doing is running it for each sample using the default parameter and also with the multibinning strategy as well. It should be done by the end of the weekend and then I can DASTool the end results.
Do you happen to have a linux or OSX conda recipe or yaml file for the GPU enabled environment? I have some issues getting the packages installed together with the dependencies.
Unfortunately there is an issue with gpu enabled pytorch from conda so we do not have any that works. I can basically only get it to work with gpu by using pip. We haven't run GPU-enabled on OSX. If you can be more specific I can try to guide you if you still have issues
pip
packages work in conda environment exports so I think that can still work with a conda env export -n vamb-gpu_env -f path/to/vamb-gpu_env.yml
(assuming the name of the environment is vamb-gpu_env
).
Also, thanks for the tip on using the multi-binning method. I ended up using a __
separator and it made everything very easy to split out later for DASTool
.
For my 88 oral metagenomes, I ran metabat2
, maxbin2
, vamb
for each sample individually, and then vamb
multibinning. The vamb
multibinning performed very well. The resulting numbers for the bins after running DASTool
on everything:
VAMB with multi-binning: 181 bins VAMB per sample: 67 bins MaxBin2: 273 bins Metabat2: 311 bins
What I found from other datasets is that VAMB
is able to pick up "harder to bin bins" that can't be detected with Metabat2
and MaxBin2
. It's honestly a great tool to have in the repertoire. I remember seeing the biorxiv a few years back and thinking "awesome, somebody is finally implementing a VAE for metagenomic binning" as I've seen applications on MNIST/MNIST-Fashion following deep learning research but not the expertise (nor funding or time) to implement myself. It's great to actually get it going in my pipelines after all these years of seeing it in action in other fields.
Regarding conda
, vamb
works fine when from pip
, the problem is getting pytorch
compiled with gpu support - it is not available through conda
. If you know how to fix this please let us know :)
If you do not have GPU support and still wants to do multi-binning I think it is safe to reduce the number of epochs to ~300 (I think @jakobnissen changed that in the default now).
Thanks for the words and great to see that VAMB can contribute to your results and picking up bins that are more difficult for MaxBin and MetaBat!
I noticed the bit in the documentation about suggesting multiple hyperparameters: https://github.com/RasmussenLab/vamb#parameter-optimisation-optional
Is there a grid you recommend we use?
I was using the following grid:
Is this the suggested way to use
-n
parameter (i.e. have it twice? I'm assuming this is the shape of the layers)?Do you recommend a range of
-l
values and-n
values we should try?Also, sidenote question: Can we use the
jgi_summarize_bam_contig_depths
output as is w/ defaults or do we need to trim off specific columns or use specific parameters?