Closed JaphethLin closed 2 years ago
Hi,
is that 30 million contigs actually used by binny, so the max_n_contigs
in the config is e.g. 30e6?
I dont remember having that error before, its in openTSNE. Could you also post the binny log in path/to/your/output/logs/binning_binny.log
so we can check what went on before the crash?
This is my binny log :
(base) [sc56656@ln25%bscc-a binny_out]$ cat binny.log
22/02/2022 09:27:29 AM - Starting Binny run for sample binny.
22/02/2022 09:43:37 AM - Looking for single contig bins.
22/02/2022 09:47:53 AM - Finished searching for single contig bins in 210s.
22/02/2022 09:47:56 AM - Found 0 single contig bins.
22/02/2022 09:48:56 AM - 26755842 contigs match length threshold of 1500bp or contain marker genes and have a size of at least 0bp
22/02/2022 09:54:00 AM - Created load balanced list in 3s.
22/02/2022 09:54:00 AM - k-mer sizes to count: 2, 3, 4.
22/02/2022 10:00:20 AM - Finished counting k-mer frequencies for size 2.
22/02/2022 10:09:49 AM - Finished counting k-mer frequencies for size 3.
22/02/2022 10:34:47 AM - Finished counting k-mer frequencies for size 4.
22/02/2022 10:35:00 AM - K-mer frequency matrix created in 2512s.
22/02/2022 11:05:41 AM - Running with 26755438 contigs. Filtered 0 contigs using a min contig size of 0 to stay below 30000000.0 contigs
22/02/2022 11:11:13 AM - Running manifold learning and dimension reduction.
22/02/2022 11:38:24 AM - PCA stats: Dimensions: 40; Amount of variation explained: 75%.
22/02/2022 11:38:44 AM - Running t-SNE dimensionality-reduction with perplexities 10 and 100.
(base) [sc56656@ln25%bscc-a binny_out]$
And, this is my configuration :
mem:
# If your HPC resource has a high memory capacity node you can set this to
# TRUE and specify the amount of memory per core (e.g. if a node has 260 gb of
# RAM and 10 cores it would be 26).
big_mem_avail: TRUE
big_mem_per_core_gb: 31
# Memory per core of your computing resource.
normal_mem_per_core_gb: 8
# Path to a temporary directory to write to.
tmp_dir: tmp
raws:
# Path to a assembly fasta.
assembly: "data/red.contigs.sorted.rnm.t1000.fasta"
# Path to a bam file to calculate depth from (single sample mode only atm).
# Leave empty if you have an average depth per contig file to supply to binny.
metagenomics_alignment: ""
# Path to an average depth per contig tsv file. Leave empty if you supply a
# bam file for binny to calculate average contig depth from.
contig_depth: "data/binny_depth.txt"
# Sample name
sample: "binny"
# Path to desired output dir binny should create and store results in.
outputdir: "binny_out"
# Path to binny dbs, leave at default if you performed the defaulty install.
db_path: "database"
binning:
binny:
# Input a list, e.g. '2,3,4'.
kmers: '2,3,4'
# Minimum contig length.
cutoff: 1000
# Minimum length of contigs containing CheckM markers.
cutoff_marker: 0
# Maximum number of contigs binny uses. If the number of available
# contigs after minimum size filtering exceeds this, binny will
# increase the minimum size threshold until the maximum is reached.
# Prevents use of excessive amounts of memory on large assemblies.
# Default should ensure adequate performance, adjust e.g. according
# to available memory.
max_n_contigs: 3e7
# Distance metric for opentSNE and HDBSCAN.
distance_metric: 'manhattan'
embedding:
# Maximum number of binny iterations.
max_iterations: 50
# Number of iterations with perplexity: (n contigs) / 100.
tsne_early_exag_iterations: 250
# Number of iterations with perplexity 1.
tsne_main_iterations: 750
clustering:
# Increasing the HDBSCAN cluster selection epsilon beyond 0.5
# is not advised as it might massively increase run time, but it might
# help recover fragmented genomes that would be missed with lower settings.
hdbscan_epsilon: 0.25
# Adapted from the HDBSCAN manual: 'Measure of how conservative the
# clustering should be. With larger values, more points will be declared
# as noise, and clusters will be restricted to progressively more dense areas.'.
hdbscan_min_samples: 2
# Use depth as additional dimension during the initial clustering.
include_depth_initial: 'False'
# Use depth as additional dimension during the main clusterings.
include_depth_main: 'True'
bin_quality:
# Minimum value binny will lower completeness to while running. It will
# start at 90.
completeness: 80
# Minimum purity for bins to be selected.
purity: 85
It might be that that kind of input is just too large for openTSNE, ill look into it.
In the meantime, unless you need all contigs to be included, i would recommend to set cutoff_marker
to e.g. 300-500 to hopefully eliminate a large amount of small contigs. Or is the assembly already prefiltered to a certain contig length?
Btw, your config says cutoff: 1000
but the log reports ...contigs match length threshold of 1500bp...
. Are you certain you are running with this config file (although the rest seems to match)?
Oh, I ran it twice in the case of only changing the cutoff (1000 and 1500bp), but both got this error. Could the tsne_early_exag_iterations parameter be the cause of this error ? Then I will try again according to your suggestion.
Ok.
tsne_early_exag_iterations
just controls the number of iterations to run the first part of the embedding, but it seems to fail to run at all.
Let us know, if it worked!
Hi, I had some problems running the last step, Binny. Is it because the data set is too big? I have 30 million sequences.
An error message is displayed below :
Thanks in advance!