No scaffolding with arks-links

rimjhimroy commented 5 years ago

Hi,

I have produced a draft genome assembly of ~1Gbp plant genome with MaSuRCA based on Illumina paired-end reads and I additionally have ~50X 10x Genomics data which I wanted to use to scaffold the draft genome.

I first used longranger basic (longrangerv2.2.2) to produce the interleaved barcoded fastq.gz files the stat for which are:

barcode_diversity,bc_on_whitelist,num_read_pairs
788519.937003,0.946642430348,164229013

The barcoded fastq.gz file given by longranger basic is in the following format:

@A00574:80:H7TYWDRXX:1:2102:10167:3098 BX:Z:AAACACCAGCTAACTC-1
GTGGGTGAGGCGATATAGGCGAGGGTTTTGGGTGGGTCAGACGGCCACACATACAGCTCATCCTTGGTGTTGCCACGGAGTAATCGTGCCCCCGTACTGAGATCCTTCACCTGAAACGATGTAGGGAAGAATTCAATCGATATGTTATTGTTGTTACAGAGGCGATAAACTGAGAGAAGGTTGCGATGTAGGTTAGGAACAACTAGCACATCATGTAAATGAATATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFF,F,FFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFF:FF::,FFFF,FFF:FFFFFFFFFFFFFFFFFF:FF,FF:FF:F
@A00574:80:H7TYWDRXX:1:2102:10167:3098 BX:Z:AAACACCAGCTAACTC-1
GCAAGCTACACCGTGGCAGCCGAGAGCACATCTTGCTTCTGGTCCTTCGCTTAATCCAACAAACTGGATATTGGACACATGGGCCACTCATCACTTGACAACAGACTTGAGTAATTTGGCGTTGCATCAACCATATACGAATGGCGACGAGGTTACTATAACTGATGGTACAGGTCTTGGGATCTCGTATACTGGTTCTGCCCTTCTCCCTACTCCCTCTGTTCATATTCATTTACATGATGTGCTAGTTG
+
F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

I first used the default tigmint parameters and then the following parameters to run arks:

arks-make arks draft=mref.tigmint m=25-20000 z=1000 k=60 reads=lfq j=0.5 t=30 a=0.5

Here are some stats for- a) my original assembly, b) tigmint split assembly, and, c) final assembly using arks-links pipeline:

n       n:100   n:N50  min    median mean   N50    max     sum
414463  414463  36099  300    877    2131   5201   535505  883.3e6 mref.fa
423326  422710  37849  100    881    2089   4948   535505  883.3e6 mref.tigmint.fa
423297  422681  37820  100    881    2089   4948   535505  883.3e6 mref.tigmint_c5_m25-20000_k60_r0.05_e30000_z1000_l5_a0.5.scaffolds.fa

I am not sure why I am not getting an improvement in scaffolding using my data.

I was wondering if you could please help me by letting me know if I am doing something wrong and how I can improve it?

Arks log file: run_arkslinks.txt

Thanks a lot,

Best, Rimjhim

lcoombe commented 5 years ago

Hi @rimjhimroy,

It looks like there were joins made in the assembly but very few? I think this is not the case but double check that you don't have an empty graph file (*original.gv).

I'd suggest sweeping on a few of the parameters. For example:

z=500,1000,3000
a=0.5,0.7,0.9
k=40,60,80,100

You could also plot the distribution of the barcode multiplicity (That multiplicity file is an output of the arks-make pipeline) to double check that your specified multiplicity range includes the bulk of your data.

Hopefully one of those suggestions improves your resulting contiguity! If needed, you could also look at lowering c and l. I should note that ARKS will work best with a more contiguous assembly, and your assembly isn't particularly contiguous right now. Have you considered assembling your chromium data with Supernova? We've seen good results with running Supernova, and then scaffolding with ARKS.

Hope that helps! Lauren

rimjhimroy commented 5 years ago

Hi Lauren

Thank you for your quick reply. I don't have an empty *original.gv file. Looking into the barcode multiplicity file I find that most of my barcodes are in the range 5-1000,

I also have 350433 out of 1623642 barcodes with multiplicity =2. My read lengths on an average are 250bp. Does this sound concerning to you?

I am running Supernova on my data, but I am still waiting for it to finish after 20 days.

Thanks again, Rimjhim

lcoombe commented 5 years ago

Hi Rimjhim,

Good to know that the gv file isn't empty and your barcode multiplicity seems in the right ballpark.

It is expected that a large number of the reads will have barcodes with a low multiplicity -- that's just because the barcode is in read 1 (and then clipped out by longranger basic), so it it possible to get base errors there. Those barcodes are likely just due to these errors.

Your chromium reads are 250bp? Huh I have only come across chromium reads that are 2x150bp (128bp/150bp after longranger basic), and I was under the impression that that was standard for the 10x Genomics tools. Based on the 10X genomics website, Supernova expects 2x150bp reads: https://support.10xgenomics.com/de-novo-assembly/sequencing/doc/specifications-sequencing-requirements-for-de-novo-assembly Was this a bespoke library construction process? Certainly 20 days is a very long time for Supernova -- I'd expect a human-sized assembly to finish within a week at most.

Lauren

lcoombe commented 4 years ago

Closing this issue due to inactivity -- feel free to re-open if you still have questions.

bcgsc / arks

No scaffolding with arks-links #23