aidenlab / 3d-dna

3D de novo assembly (3D DNA) pipeline
MIT License
207 stars 55 forks source link

3d-dna run splits the long-reads scaffolds into smaller chunks #75

Closed apn83 closed 4 years ago

apn83 commented 4 years ago

Hi, I have scaffolds with contigs from pacbio long reads, scaffolded with 10X data. My initial is assembly is good with 260 scaffolds with N50 2 Mb. I got my HiC data and aiming for chromosome-scale assembly. I ran juicer and got the following stats:

Sequenced Read Pairs:  83,275,196
 Normal Paired: 70,033,713 (84.10%)
 Chimeric Paired: 6,431,410 (7.72%)
 Chimeric Ambiguous: 4,124,430 (4.95%)
 Unmapped: 2,685,643 (3.23%)
 Ligation Motif Present: 205,405 (0.25%)
Alignable (Normal+Chimeric Paired): 76,465,123 (91.82%)
Unique Reads: 60,032,690 (72.09%)
PCR Duplicates: 16,401,035 (19.69%)
Optical Duplicates: 31,398 (0.04%)
Library Complexity Estimate: 151,590,890
Intra-fragment Reads: 35,733,222 (42.91% / 59.52%)
Below MAPQ Threshold: 11,600,830 (13.93% / 19.32%)
Hi-C Contacts: 12,698,638 (15.25% / 21.15%)
 Ligation Motif Present: 104,068  (0.12% / 0.17%)
 3' Bias (Long Range): 49% - 51%
 Pair Type %(L-I-O-R): 23% - 28% - 27% - 23%
Inter-chromosomal: 4,365,736  (5.24% / 7.27%)
Intra-chromosomal: 8,332,902  (10.01% / 13.88%)
Short Range (<20Kb): 8,212,061  (9.86% / 13.68%)
Long Range (>20Kb): 120,810  (0.15% / 0.20%)

I ran 3d-dna pipeline mainly aiming for superscaffolds

bash $apath/run-asm-pipeline.sh $draft/scaffolds.Protein.masked.fasta $mnd/merged_nodups.txt

In the final fasta, I found 1455 sequences !. Whereas, using SALSA2, I got 230 sequences. Next, I did only run-liger-scaffolder.sh, it produced .asm file with single line (I guess order of scaffolds). How do I proceed from here on ?

In the cookbook, it was mentioned about problem using long-reads scaffolds, is there any work-around? Kindly help and thanks in advance !

dudcha commented 4 years ago

This is answered on the forum, closing since not a bug. Thanks, -Olga