question about the parameters -mbs -bc -bs --ae-dims --ae-epochs

wn835166087 commented 2 years ago

Out of curious, what does it mean if a cluster has less than 5000 reads? if I set the -mbs as 1000, does that mean this bin is highly likely to be incomplete?

anuradhawick commented 2 years ago

This is a parameter in the clustering algorithm. You can set 1000. It does not affect the completeness of bins.

It is to filter out too small bins. For example if a confident cluster is found to have less than 5000 reads, we remove it. You can lower this value to retain any small clusters.

Let me know if this helps.

If you need any help tuning parameters of LRBinner I’m happy to help too.

wn835166087 commented 2 years ago

Thank you, this is helpful. I tried the 1000 mbs and 3000 mbs for one metagenomic sample. Not surprisingly, 1000 mbs retains more bins. Afater assembly, the bins yield by 1000 mbs showed a tiny bit lower contamination and a tiny bit higher completeness. However, it's only for one sample.

In addition, I'm still not sure about other parameters: what are the meaning of -bc -bs? Does -bc means only bins with coverage higher than 10 will be retained? but what does -bs mean? also I have little idea about --ae-dims --ae-epochs, what impact do they impose on the binning process? Sorry, I have little idea about these parameters. Thank you in advance.

anuradhawick commented 2 years ago

Hi thanks for these important questions. I am happy to help. Let me know if the following information is sufficient.

-bc and -bs Are parameters for the coverage vectors. These are discussed in our MetaBCC-LR paper in detail.

What happens is we use k-mers from the read, and their counts in the entire dataset to create histograms for each read. These are the coverage vectors. Histograms are built by grouping counts into fixed width bins/buckets. So, -bc sets the number of such bins/buckets. -bs determines the width of each of these bins/buckets. The histogram can approximately host coverages between 0X to (bs*bc)]X. Approximately because, I put counts that exceed the histogram range into last bucket so we don't miss out on very high coverage reads. These histograms are normalized so can be used in ML techniques, etc. You can use defaults or use some custom values. For example, if you run a tool like DSK or Jellyfish and see the k-mer distribution and note the frequency ranges between 0 and 500, you can roughly set -bc as 25 and -bs as 20. Use a -bs value greater than 10 because the coverage of k-mers aren't precise and needs some noise tolerance. This is completely for input feature modification. Not a parameter related to clustering algorithm of bin size. But it understandably affect the latter.

This part of the pipeline is very important and I have published this as a separate tool you may find at https://github.com/anuradhawick/seq2covvec.

--ae-dims and --ae-epochs are neural network parameters for the auto-encoder.

--ae-dims determine the size of the latent dimensions in auto encoder. This is similar to reducing dimensions of data from input size (coverage vector size + composition vector size) to this value, --ae-dims. I have used --ae-dims 8 in the paper, but I have later observed values 4, 8 and even 16 tend to produce better results depending on dataset size and actual bins inside data.

--ae-epochs is the number of iterations for the neural network training (auto-encoder training). This is typically in range 250-500. This is not a critical parameter unless for testing purposes of the tool. Have it at 250 or icrease it to 500 if you have GPU facilities for faster ML.

I am happy to help and let me know if the above information is sufficient.

Cheers

anuradhawick / LRBinner

question about the parameters -mbs -bc -bs --ae-dims --ae-epochs #8