Very fragmented and highly redundant assembly

jaworskicoline commented 7 years ago

Hello, I have tried several attempts using PBcR_v6 and the associated mhap_v6.spec spec file. I even tried to not use the modified MHAP alogorithm, and to use PBcR instead of PBcR_v6, but I am always getting the same problems: I am using D.melanogaster data as a test, and I get roughly the same characteristics for my assembly: sequence #: 457605 total length: 5942206709 max length: 34021 N50: 17463 N90: 8414 So I have a very low N50 for about 55X PacBio data, but more importantly the total length points out at >40 times genome size. I figured out it might come from the options in the spec file, but I don't quite have a sense on how o play around them. Do you have any advices ? Thank you very much. Best regards, Coline

bernardo1963 commented 7 years ago

Hi Coline, I have two suggestions: 1) use the 95x Drosophila melanogaster datatset which I used in the paper. If you get an assembly similar to mine, than your PBcR installation is Ok, and you must look at your dataset. I am not particulalry impressed by the total length you reported; your assembly is not Ok (very fragmented for PacBio) so many contgs probably are garbage coming from poorly corrected reads, etc. 2) I am guessing that you produced the 55x dataset. As far as I know PBcR requires a higher coverage. I would suggest you to switch to Canu, which I think works Ok with 60x. We are porting the kmer validation to Canu, but it is not ready yet (should be soon).

best, Bernardo

A. Bernardo Carvalho

Departamento de Genética Universidade Federal do Rio de Janeiro

On 30 August 2017 at 14:21, jaworskicoline notifications@github.com wrote:

Hello, I have tried several attempts using PBcR_v6 and the associated mhap_v6.spec spec file. I even tried to not use the modified MHAP alogorithm, and to use PBcR instead of PBcR_v6, but I am always getting the same problems: I am using D.melanogaster data as a test, and I get roughly the same characteristics for my assembly: sequence #: 457605 total length: 5942206709 max length: 34021 N50: 17463 N90: 8414 So I have a very low N50 for about 55X PacBio data, but more importantly the total length points out at >40 times genome size. I figured out it might come from the options in the spec file, but I don't quite have a sense on how o play around them. Do you have any advices ? Thank you very much. Best regards, Coline

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bernardo1963/kmer_validation/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ5A-blEZ78j5ZUqYzHbU6CcCG9tRps4ks5sdZoQgaJpZM4PHts9 .

jaworskicoline commented 7 years ago

Hello, Thank you for your answer. Yes I am using ~50x for Dmelanogaster, because I wanted to compare it with another method that performs well on not so high coverage of PB data. From what you tell, I guess I have to abandon it, at least until there is a version working with Canu. Thank you very much. Best, Coline

bernardo1963 / kmer_validation

Very fragmented and highly redundant assembly #2