claczny / VizBin

Repository of our application for human-augmented binning
27 stars 14 forks source link

"Minimal conting length" not working #40

Closed evensannesriiser closed 6 years ago

evensannesriiser commented 6 years ago

Dear Cedric,

I'm really excited to try VizBin, but I run into the following problem when performing the binning: The program fails to use the "Minimal contig length" option! This is very frustrating.

The "contigs.fa" file I'm using contains sequences of different length, from < 300 to > 200,000.

In the .log-file (see attachment), I get the following:

2017-11-22 16:51:31,001 WARN [AWT-EventQueue-0] (MainFrame.java:893) - Invalid minimal contig length value: 2 000. Using value: 1000

I can of course extract sequences above a certain length myself, but that would be a very slow process compared to specify the length as a parameter. Especially since I need to optimize the clustering by trying out different sequence lengths.

Do you have any suggestions for how to fix this?

Kind regards,

Even Sannes Riiser, PhD canditate, University of Oslo, Norway

log.txt

claczny commented 6 years ago

Dear Even,

thank you very much for your interest in VizBin 👍

From the log-quote, it looks to me as if you have entered "2 000" as minimal length, i.e., it looks as if there is an extra whitespace between the "2" and the first "0". This is, however, not supported, so please enter your minimal contig length accordingly, e.g., "2000", or "5000" (but without the double-quotes here :)

Please let me know if this fixes the issue.

Best,

Cedric

P.S. Independent of the current issue, if you are experimenting with different sequence lengths, my suggestion would be to create your input in a dedicated processes, i.e., have a script that filter your contigs.fa for a set of sequence lengths. If you automate (and check), the source of human error is reduced, at least to some extent ;)

P.P.S. If you are interested, you might want to give BusyBee Web a try (https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx348) @ https://ccb-microbe.cs.uni-saarland.de/busybee/. It implements automated clustering, draft taxonomic assignment, annotation of antibiotic resistance genes, and bin quality check; and all the user has to do is upload the data :) However, if you are in a highly explorative phase, VizBin gives you much more flexibility. So, the choice is yours ;)

evensannesriiser commented 6 years ago

Hi Cedric,

"1000" is default, but when I type in "2000", it changes to "2 000" as soon as I move the cursor to another field! Not sure why, but it insists on using a thousands separator, which apparently messes up the "minimal contig length" parameter. Any other suggestions?

I agree that it will be a good idea to make a script automating the sequence-by-length extraction.

"BusyBee Web" was totally new to me! I am a little confused about the basic differences between BusyBee and VizBin, may I ask you to very short enlight me on this? :) In what way is VizBin more flexible?

Best,

Even

claczny commented 6 years ago

This kind of issue did not appear until now. Would it be possible to share the contigs.fa with me so I could have a look? If you cannot share the the entire file, maybe you could create a "minimal failing example"?

In VizBin, the user uses the mouse cursor to manually define cluster boundaries, while the boundaries are automatically defined in BusyBee Web. Moreover, as a web-service, some upper filesize limit had to be established for BusyBee Web, while VizBin does not have such an explicit restriction. In contrast, BusyBee Web offers several annotations that are automatically computed, while the user has to provide this information explicitly to VizBin. Furthermore, the concept of "bootstrapped supervised binning" in BusyBee Web can be used to accelerate the binning process, i.e., clusters are defined on sequences with length greater-equal to a threshold ( t_{c}) and a model is learned to bin all sequences.

Best,

Cedric

evensannesriiser commented 6 years ago

Interesting! Thanks.. :) I have attached a subsampled version of my contigs file, which still gives plots of all sequences longer than 1000 bp, but not when I specify i.e. 2000 bp. PS: I had to change the suffix of the fasta file to .txt to be able to upload it.

How does that file work for you?

Even

contig_sub_5pct.txt

claczny commented 6 years ago

You're welcome :)

Your file works nicely for me. Although I do not see much structure in it, which could be simply due to the subsampling.

screen shot 2017-11-23 at 10 39 34

[EDIT]

I ran the devel version from https://github.com/claczny/VizBin/blob/devel/VizBin-dist.jar?raw=true

[EDIT2] Running the master version worked fine for me too: screen shot 2017-11-23 at 10 44 56

claczny commented 6 years ago

Are you using VizBin via the GUI or via CLI (command line)?

evensannesriiser commented 6 years ago

Hmm, that is really strange... I tried with the version you used, and still get the same issue. I'm running this on a MacBook Pro, macOS Sierra. Maybe it's a Java issue. My Java version is Version 8 Update 151 (build 1.8.0_151-b12).

I do see some more clustering using more contigs, so I think it's related to the subsampling.

Well, guess I'll have to do this the manual way, then. My plan is to color the contigs by bin ID from Maxbin2, as I presume this will provide me with some additional information. As I understand it, the best way is to isolate the sequences from every individual cluster using the polygon tool, and proceed with additional bin refinement after this (using i.e. CheckM). Does that sound sensible?

In addition to the isolated clusters, I also get a large amount of points/contigs spread "all over". These were defined as individual bins in Maxbin2, so I'm a bit uncertain about what to do with them. Maybe it's better to be conservative, and just discard them.

One more thing: Within the points spread "all over", I see a small, tight cluster of samples defined as one bin in Maxbin2, as they all have the same color (red, see attached image). Is there a way to isolate the sequences represented by these points, without also choosing the (green) points "above them"? (This would be easy if one could select/deselect points by clicking on them..)

Sorry for the many questions, but it would be good to make sure I do things right before proceeding.. :)

Thank you!

Even

bin
evensannesriiser commented 6 years ago

I use the GUI version. Where do I find the different versions ("devel" and "master")?

claczny commented 6 years ago

Good point, this might be due to updates in Java, indeed. I have Version 8 Update 144 (build 1.8.0_144-b01), which is quite old ;)

As I understand it, the best way is to isolate the sequences from every individual cluster using the polygon tool, and proceed with additional bin refinement after this (using i.e. CheckM). Does that sound sensible?

Not sure if I understand you right here. What I would probably do in your situation is to a) visualize it all together with the MaxBin2 info, like you already did; and b) visualize individual MaxBin2 bins to see if there are any MaxBin2 bins that remain mixtures, i.e., where you would see two or more bins in the VizBin plots. As such, point b) represents kind of a refinement approach (maybe followed by further refinement using CheckM or based on the coverage distribution within a bin, i.e., discarding/flagging contigs with an expectionally high or low coverage).

I also get a large amount of points/contigs spread "all over"

This is a non-trivial issue that I haven't had the time to dig deep into. My suspicion (not very scientific, I know) is that these are sequences that have somewhat similar genomic signatures (likely from somewhat closely related taxa) as well as "singleton" sequences (might be misassembled contigs, might be contigs from extremely rare taxa, etc.). Tools such as MaxBin2 or MetaBAT use an additional source of information, that is, sequence abundance. This is expected to help further resolving these sequences. Hence, using complimentary approaches seems like a good idea here ;)

Is there a way to isolate the sequences represented by these points, without also choosing the (green) points "above them"?

I do not get what you mean here. Since you already have the binning by MaxBin2 for these points, why do you want to select them again, but only the "red" ones?

Sorry for the many questions, but it would be good to make sure I do things right before proceeding.. :)

Sure, no problem. What I think is important to keep in mind in general with all those binning methods is that they can only be so good, and none is perfect. At the end, I think that there is a certain limit imposed by the data that is fed in and high-quality genomes - and I mean really high quality here :) - will require new approaches following the binning, e.g., culturing of isolates guided by the information extracted from the bins ;)

Hope this helps.

claczny commented 6 years ago

Where do I find the different versions ("devel" and "master")?

These are different branches of the repository.

evensannesriiser commented 6 years ago

I found the devel version, and still had the same problem..

Thank you for taking the time to give me such a thorough feedback! I'll try binning with a longer sequence cutoff (the manual way), and maybe check out BusyBee as well.

Have a nice day!

Even :)