edgraham / BinSanity

Unsupervised Clustering of Environmental Microbial Assemblies Using Coverage and Affinity Propagation
GNU General Public License v3.0
29 stars 14 forks source link

Issue with low_completion.fna and 4mer #46

Open michoug opened 4 years ago

michoug commented 4 years ago

Hi, When running the "lc" part of your software, I got this error :

                ____________________________________________________
                        Calculating 4mer frequencies for 
                        redundant bin low_completion.fna
                ____________________________________________________
          kmer frequency calculated in 15164.698575735092 seconds

                ____________________________________________________
                        Creating Profile for 
                        redundant bin low_completion.fna
                    ____________________________________________________
           Combined profile created in 167.75693249702454 seconds

                ____________________________________________________
                    Reclustering redundant bin low_completion.fna
                ____________________________________________________
          Preference: -25
          Maximum Iterations: 4000
          Convergence Iterations: 400
          Contig Cut-Off: 1000
          Damping Factor: 0.95
          Coverage File: HC_HiSeq_BinSaniy_cov.cov.x100.lognorm
          Fasta File: low_completion.fna
          Kmer: 4
          (300263, 266)
BinSanity failed when refining you genomes :/. The Bin that it failed at was the following bin: low_completion.fna

Any ideas, can I used the bins in the REFINED-BINS folder ?

edgraham commented 4 years ago

Hello,

This error is ultimately related to memory. At around 300,000 contigs when I have gotten to that number in the past is when I am hitting around 600GB of RAM. The first thing I usually advise for people running into this type of memory related issue is to consider your contig cut-offs. While I have tested Binsanity down to contigs of 1000bp I find that setting a cuttoff at ~ 2000bp typically speeds up the run significantly and does not remove any bin quality. Ultimately below 2000kbp while these contigs can often be useful they also often have more variable coverage profiles and composition metrics that may not align directly with the actual source genome, often when I include contigs this small most end up unbinned or I end up having to do quite a lot more manual genome refinement using anvio to confirm contig assignment. Increasing your cut-off to 2000bp would be the quickest way to speed up the run and reduce complexity.

Hopefully this means that it refined all of the genomes except that last "low_completion.fna" file. The Genomes in 'REFINED-BINS', 'high_completion', and 'strain_redundancy' would be where we would be pulling our final set of genomes from ultimately. Having said that loosing out on thos low_completion Genomes would be a bummer so if you don't want to up your contig cut-off you can use this work around with the caveat that in the past when I have done this I have sacrificed some amount of bin quality so you should consider some amount of close assessment.

From what you have told me it seems that 'Binsanity-lc' has finished refining all of the 'high_redundancy' genomes but failed when it hit that last group of contigs in the 'low_completion.fna' fasta file. So first take all the current genomes from 'REFINED-BINS', 'high_completion', and 'strain_redundancy' and move them to a directory called 'Final-Genomes' as this are finished processing. Then take the 'low_completion.fna' fasta file and try running it through 'Binsanity-lc' on its own using the same parameters and hopefully it will run with just that subset. If it doesn't some other things you could try is reducing the requirements for refinement in the source code. To do this find the function in the code:

def checkm_analysis(file_, fasta, path, prefix):
    df = pd.read_csv(file_, sep="\t")
    highCompletion = list(set(list(df.loc[(df['Completeness'] >= 95) & (df['Contamination'] <= 10), 'Bin Id'].values) + list(df.loc[(df['Completeness'] >= 80) & (
        df['Contamination'] <= 5), 'Bin Id'].values)+list(df.loc[(df['Completeness'] >= **40**) & (df['Contamination'] <= 2), 'Bin Id'].values)))
    lowCompletion = list(set(df.loc[(df['Completeness'] <= **40**) & (
        df['Contamination'] <= 2), 'Bin Id'].values))
    strainRedundancy = list(set(df.loc[(df['Completeness'] >= 50) & (
        df['Contamination'] >= 10) & (df['Strain heterogeneity'] >= 90), 'Bin Id'].values))

Change the two bolded numbers. We can play around with this more if necessary but this is a good starting place.

Let me know how it goes.

-Elaina