anuradhawick / LRBinner

LRBinner is a long-read binning tool published in WABI 2021 proceedings and AMB.
https://doi.org/10.4230/LIPIcs.WABI.2021.11
GNU General Public License v2.0
29 stars 5 forks source link

KeyError with cluster_utils.py #1

Closed GeoMicroSoares closed 2 years ago

GeoMicroSoares commented 3 years ago

Hi there,

In running LRBinner under default settings on an ONT metagenomic assembly I'm getting the following error - any idea of what's going on there?

I'm guessing that in finding no bins from the assembly, a normal output would come up, not a KeyError. This assembly isn't the best, but from other analyses I've ran it seems there are at least OK bins in there.

Thanks in advance.

Traceback (most recent call last):
  File "/home/adannehl/LRBinner/LRBinner", line 273, in <module>
    main()
  File "/home/adannehl/LRBinner/LRBinner", line 258, in main
    output, iterations, min_cluster_size, binreads, reads_path)
  File "/home/adannehl/LRBinner/mbcclr_utils/cluster_utils.py", line 340, in perform_binning
    binout.write(f"{read_bin[r]}\n")
KeyError: 0
anuradhawick commented 3 years ago

Hi,

Thanks for the issue. I will look into this and get back to you soon. LRBinner is designed to bin long reads before assembly. Can I ask if you're using assembly contigs or reads?

I'll fix this issue in the mean time.

GeoMicroSoares commented 3 years ago

Woops. It's contigs - should have read the README a bit deeper. Re-running on the reads and will get back to you on wether the error remains. Thanks!

anuradhawick commented 3 years ago

No worries. I still need to fix this error.

If you don't mind, please let me know the outcome. I'll be happy to have your input as a student actively working to improve binning and post-binning metagenomics assemblies.

Best regards Anuradha

GeoMicroSoares commented 3 years ago

Hi again @anuradhawick, my second run ended successfully so this must have really been due to me using contigs instead of reads. I am however getting very big bins back (392MB - 1.8GB big), which I wouldn't expect in principle, since these should be bacteria-dominated communities. Do you have any suggestions for optimizing LRBinner? This was run under truly default conditions just to test the software.

anuradhawick commented 3 years ago

Hi,

There are few things you could do like changing parameters for LRBinner.

To confirm again, we expect the users to assemble the binned reads to get final results.

The basic output from LRBinner is bins of reads, which shall be assembled (or analysed with known references).

Let me know if this helps. Moreover, if data is public I might as well take a look.

GeoMicroSoares commented 3 years ago

Hi again. Thanks for this! Assembling bins with canu for now - would you recommend another assembler for this? Will check out results and get back to you on whether further tweaking of parameters is needed.

anuradhawick commented 3 years ago

Any assembler should work. I usually try wtdbg2 first (as it's fast for initial analyses) and use metaFlye for final metagenomic assemblies. Canu is fine too.

anuradhawick commented 3 years ago

Hi @GeoMicroSoares

We updated LRBinner to handle long-read assemblies. In case you're interested please have a try and let us know how it is :) As I remember you were trying to bin long-read assemblies before and faced few issues. This is still in development but worked quite well on small assemblies (~10-20 species). We are continuously improving this.

GeoMicroSoares commented 3 years ago

Hi @anuradhawick ,

Thanks so much for this - tried and it ran successfully, just had some weird results as seen below:

$ cat binned_contigs_checkm/stats.out
Bin Id  Marker lineage  # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness  Contamination    Strain heterogeneity
Bin-0   root (UID1)     5656    56      24      0       0       6       17      1       32      100.00  363.89  40.86
Bin-1   root (UID1)     5656    56      24      0       0       0       0       0       56      100.00  4357.03 4.14
Bin-2   root (UID1)     5656    56      24      0       0       0       0       0       56      100.00  5041.53 3.36

Parameters used were default - LRBinner contigs -r merged_SM_L_sup_min2k.fastq -c SM_L_sup_flye2_9_assembly.fasta -o SM_L_sup_min2k_lrbinner_nv_assembly/ --cuda -sep. Maybe I could change something there to improve this?

anuradhawick commented 3 years ago

Thanks for trying out, your feedback is very important and valuable. Apparently we are producing too less bins which is why you see so much contamination. Could you let me know;

1) assembly size in MB 2) no. of long contigs (1000bp or more) 3) reads size and reads count 4) suspected species count 5) kind of assembly (bacterial, fungal, etc)

This will helps us have an idea on issues and much appreciated. Can I ask data source if publicly available. Since this is our first try at binning contigs we seems to need lot more adjusting. I will keep you posted too.

GeoMicroSoares commented 3 years ago

Hi again,

No worries, thanks so much for getting this tool going!

  1. 446Mb
  2. 21348
  3. Read size: min = 2,000 bp, max = 163,628 bp (avg. 7,234.5 bp) Red count: 2,327,181 reads
  4. Species count should be around 50 - 200...
  5. Should be a mix of bacteria and archaea.

The data isn't available yet, but if it becomes so in the meantime I will let you know! Thanks again!