bcgsc / arcs

🌈Scaffold genome sequence assemblies using linked or long read sequencing data
GNU General Public License v3.0
91 stars 16 forks source link

ARKS fails to create any links with a highly repetitive input #154

Closed skagawa2 closed 2 years ago

skagawa2 commented 2 years ago

Hello! Thank you for creating this tool.

Upon trying this tool with a repetitive insect assembly (lots of satellite repeats, among others; 71% masked via RepeatModeler + RepeatMasker), this tool fails to create any links between contigs (empty graph output).

For reference, the assembly was created using hifiasm and PacBio HiFi reads (~80X coverage) and we recently sequenced TellSeq reads (~28X coverage).

The .out file from RepeatMasker seems to indicate that most contigs contain almost 20kb of various repetitive sequences at both ends, and we believe that this is the cause of the lack of links being created. ARKS worked for the example data provided and for a less repetitive genome that we compared against (not an insect) so we came to this conclusion. I have been looking through a few of the issues here to see if the input data is formatted incorrectly, but soft-masking and text wrapping do not seem to change the output. I also tested versions 1.2.1 and 1.2.3 with no difference either.

From issue #151, it seems like the dictionary print out for the number of barcodes filtered seems to be important, so this is the output from one of the runs:

{ "All_barcodes_unfiltered":11262572, "All_barcodes_filtered":1135684, "Scaffold_end_barcodes":369548, "Min_barcode_reads_threshold":50, "Max_barcode_reads_threshold":10000 }

Here is the full log from just ARKS: arks.log

Would you happen to know of any scaffolders that might work with our data or happen to know what parameters we should change to get this tool to work? Or is the repetitive content too much and these kinds of genomes not able to be scaffolded using TellSeq data?

lcoombe commented 2 years ago

Hi @skagawa2,

This is a really interesting use case for ARKS! That's good that you successfully ran the test example (so we know that the issue isn't your installation). Thanks for providing your full log - I noticed this section which was interesting:

Stored read pairs: 466614
Skipped invalid read pairs: 16643
Skipped unpaired reads: 0
Skipped reads pairs without a good contig: 96019295
Total valid kmers: 685921800
Number invalid kmers: 32107783
Number of kmers found in ContigKmap: 4148260177
Number of kmers recorded in Ktrack: 242035114
Number of kmers found in ContigKmap but duplicate: 3906225063
Number of reads passing jaccard threshold: 1386008
Number of reads failing jaccard threshold: 189697328

Particularly, a relatively small number of read pairs were stored compared to the total number of read pairs in your dataset.

One thing that you could try is using ARCS in the traditional (ie. not ARKS) mode - which uses read alignments. I'm wondering if we'll be able to get better hits of barcodes to the contig ends (as they're so repetitive) using the alignment vs. the kmer mapping.

For your TELL-Seq data, do you have a sense of the expected molecule size and barcode multiplicity? I just ask because you could also increase the head/tail lengths of the contigs (-e), but as it's at 30kb right now, I wouldn't expect much of a difference if the molecule sizes that mostly in that range.

Thank you for your interest in ARCS! Lauren

skagawa2 commented 2 years ago

Here is the molecule size histogram after BluePippin size selection that were used for TellSeq

image

Here is a histogram of the barcode multiplicity file:

image

ARCS also generated the same result: an empty original.gv. The log is attached here: arcs.log (in this run, I also tried changing -e).

lcoombe commented 2 years ago

Hi @skagawa2,

Thanks for the update. Given the lower coverage of the TELL-Seq, another thing we could try is lowering the c parameter - which is the number of read pairs required for a barcode hit to a contig end. I'm wondering if given your lower linked read coverage, perhaps there are issues with getting those hits to the contig ends. You could also try reducing the lower value for m - it might help to boost your signal a bit.

For c, I'd even try setting is as low as 2 as a test - that's quite low, so less specific for the barcode mappings, but would help us at least see if that tweak can help achieve a non-empty graph file.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your interest in ARCS!