RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
246 stars 45 forks source link

Recommended parameters for running VAMB for tuning? #72

Closed jolespin closed 3 years ago

jolespin commented 3 years ago

I noticed the bit in the documentation about suggesting multiple hyperparameters: https://github.com/RasmussenLab/vamb#parameter-optimisation-optional

Is there a grid you recommend we use?

I was using the following grid:

for N in 128 512 1024;
    do echo $N;
        for L in 24 32 48;
        do echo $L;
            OUT=883/vamb_N${N}-L${L}_output
            NAME=job_883_vamb_N${N}-L${L}
            qsub -cwd -P 0594 -N ${NAME} -j y -o ${NAME}.o "source activate binning_env && vamb --outdir $OUT --fasta $FASTA --jgi $COV  --minfasta 200000 -n $N $N -l $L"

Is this the suggested way to use -n parameter (i.e. have it twice? I'm assuming this is the shape of the layers)?

Do you recommend a range of -l values and -n values we should try?

Also, sidenote question: Can we use the jgi_summarize_bam_contig_depths output as is w/ defaults or do we need to trim off specific columns or use specific parameters?

simonrasmu commented 3 years ago

For running with different hyperparameters you can basically try what you want and see what works for you. We suggest using a slightly smaller and slightly larger network and that is a good starting point. Feel free to try other combinations - we have seen that depending on the dataset different hyperparameters might be a bit better than others.

For the -n parameter you should use it as size of the hidden layers, ie. -n 512 512 will give you two hidden layers of 512 neurons each. You can try any combination of neurons in the layers and the number of layers.

For jgi_summarize_bam_contig_depths you should be able to use it as default for input with --jgi. There were some issue with this in the past so let us know if it does not work. If you look at the snakemake file in the workflow folder you can see that we change it to the rpkm format where first column is contig-name and the remaining columns then the abundances from each sample.

jolespin commented 3 years ago

Thanks, that makes a lot of sense with the -n parameter. I took a look at the code for --jgi and it appears to do all the work necessary in removing the unnecessary columns. I unfortunately couldn't get the GPU installation working on our server so the multibinning (which I was very excited to use) was going to take too long w/ the hyperparameter tuning. Instead I'm running it per sample using a pretty extensive grid. Going to run DASTool after to get the best combinations. I'll post my results here in case it's useful for future version hyperparameter tuning or someone perusing around.

simonrasmu commented 3 years ago

We highly recommend the multi-binning strategy - so I would recommend that you at least do one run of that - you can combine the bins afterwards with DASTool etc.

jolespin commented 3 years ago

Thanks for the heads up. What I'm doing is running it for each sample using the default parameter and also with the multibinning strategy as well. It should be done by the end of the weekend and then I can DASTool the end results.

jolespin commented 3 years ago

Do you happen to have a linux or OSX conda recipe or yaml file for the GPU enabled environment? I have some issues getting the packages installed together with the dependencies.

simonrasmu commented 3 years ago

Unfortunately there is an issue with gpu enabled pytorch from conda so we do not have any that works. I can basically only get it to work with gpu by using pip. We haven't run GPU-enabled on OSX. If you can be more specific I can try to guide you if you still have issues

jolespin commented 3 years ago

pip packages work in conda environment exports so I think that can still work with a conda env export -n vamb-gpu_env -f path/to/vamb-gpu_env.yml (assuming the name of the environment is vamb-gpu_env).

Also, thanks for the tip on using the multi-binning method. I ended up using a __ separator and it made everything very easy to split out later for DASTool.

For my 88 oral metagenomes, I ran metabat2, maxbin2, vamb for each sample individually, and then vamb multibinning. The vamb multibinning performed very well. The resulting numbers for the bins after running DASTool on everything:

VAMB with multi-binning: 181 bins VAMB per sample: 67 bins MaxBin2: 273 bins Metabat2: 311 bins

What I found from other datasets is that VAMB is able to pick up "harder to bin bins" that can't be detected with Metabat2 and MaxBin2. It's honestly a great tool to have in the repertoire. I remember seeing the biorxiv a few years back and thinking "awesome, somebody is finally implementing a VAE for metagenomic binning" as I've seen applications on MNIST/MNIST-Fashion following deep learning research but not the expertise (nor funding or time) to implement myself. It's great to actually get it going in my pipelines after all these years of seeing it in action in other fields.

simonrasmu commented 3 years ago

Regarding conda, vamb works fine when from pip, the problem is getting pytorch compiled with gpu support - it is not available through conda. If you know how to fix this please let us know :)

If you do not have GPU support and still wants to do multi-binning I think it is safe to reduce the number of epochs to ~300 (I think @jakobnissen changed that in the default now).

Thanks for the words and great to see that VAMB can contribute to your results and picking up bins that are more difficult for MaxBin and MetaBat!