aidenlab / 3d-dna

3D de novo assembly (3D DNA) pipeline
MIT License
207 stars 55 forks source link

3d-dna degrades assembly? #74

Open melop opened 4 years ago

melop commented 4 years ago

Dear Olga,

I have now tried different parameters on my dataset. I started with a chromosome-level assembly built from linkage map. The default parameters severely degraded this assembly, which already appears to be pretty good on a coarse resolution (see iteration 0). I then changed the editor and polisher parameters to make it less aggressive in misjoin correction. To be specific, I used:

run-asm-pipeline.sh -m haploid -r 3 -q 1 -g 1000 --editor-coarse-stringency 50 --editor-repeat-coverage 5 --editor-coarse-resolution 250000 --editor-coarse-region 500000 --editor-fine-resolution 5000 --polisher-coarse-resolution 250000 --polisher-coarse-region 500000 --polisher-coarse-stringency 50 --polisher-fine-resolution 5000 --stage polish --build-gapped-map $Contigs $MNF

But still, as the iteration increases, there seems to be newly introduced misassemblies, including inserting some small scaffolds and inverting others. In the final polishing step, 2 chromosomes were split into pieces (the chromosome number is 19).

Are there other parameters I should tune?

Thank you!! nfz.pdf

dudcha commented 4 years ago

Hey melop,

Sorry not sure I follow everything that is going on. You are saying you've started with a draft that was scaffolded with linkage data. Is this what you call "0 iteration" pic? I.e. 0 is the visualization of the draft? Or this is 3d-dna round 0 on top of those? Are the small dots throughout the map in fact small misassemblies? I don't see annotations until split which makes me question what is going on here: 3d-dna by default produces a genome-wide scaffold all the way until split..

Also, since this is not a matter of bugs, perhaps it would make sense to move this discussion to the forum?

Best, Olga

melop commented 4 years ago

Hi Olga,

Iteration 0 should correspond to the output of step.0, which has been through one round of scaffolding but without before any correction. Therefore the 19 chromosomes are just what we have fed into the program. In this run I am using the default mapq cutoff = 1. This genome is highly repetitive so much so that 40% of reads are not reliably mapped. So it’s unclear if those dots are simply an artifact of mismapped repeats. but they could be small bits of misassemlies. If you follow the behavior of 3d-dna, there seems to be some unexpected behaviors, such as inserting scaffolds or mis oriented scaffolds, effectively breaking up the assembly. This seems to be particularly clear after the polishing step. the split algorithm (I displayed the blue boxes) also doesn’t seem to identify the chromosome boundaries. Another thing is that the estimated genome size is 1.5 Gb, but we are only able to assemble about 1Gb of sequences, which suggests that many repeats or gaps are contracted. So the HiC signal across these region would appear to drop down. Ray

dudcha commented 4 years ago

Ray,

My questions remain the same as is the request to move to forum. These are not discussions of bugs. Please share your out and err streams for this to be at all informative. I cannot tell from your images if .assemblies are loaded or not. I don't know what you mean by 'feeding 19 chromosomes into the program'. If you are running with default mapq cutoff these are not mismapped reads. You can zoom in on those questionable regions and figure out if they are misjoins: this is what 3d-dna dumps maps for. Use .wig and .bed files associated with each step to understand what gets annotated as misassemblies/chromosome boundaries etc. and why: too little data (tracks are unsaturated), coverage biases (too much gets annotated as repeats) etc.

Hope this helps, Olga

melop commented 4 years ago

Hi Olga, I think it's easier to just write you an email. I couldn't seem to find your email address, could you let me know? Thanks! What I meant was that the scaffold fasta we provided as the input to 3d-dna already contains 19 long scaffolds representing the 19 chromosomes of the 1.5 GB genome. These scaffolds were obtained by combining data from short PE reads, mate pairs reads, 10x genomics linked reads and genetic maps. We were indeed hoping to use 3d-dna to correct assembly errors. If I run 3d-dna with its default parameters, these previously assembled chromosomes are broken into thousands of debris. I have then tested a range of parameters by lowering the editor stringency and increasing the editor resolution. This seems to help to reduce the number of suspected break points (I did inspect the bed and wig tracks for each step during the process). But even after doing this, if you look at the contact map in the last steps, sometimes you see two pieces that should have been joined disconnected and sometimes two blocks inverted. So I am wondering what prevented 3d-dna from (re)joining those scaffolds in the correct orientation. From reading the manual it appears that the default mapq cutoff is 1 in 3d-dna? What is the distribution of mapping quality in the previous genomes that were successfully analyzed in 3d-dna? In our case we lose more than half of the data if we use a higher cutoff with map=15. But perhaps you meant even with potentially mis-mapped reads the signal should still prevail the noise? What happens if the restriction enzyme preferentially cut in certain repeats? Indeed we can see clear enrichments of certain sequences in the HiC library as shown by fastQC, which is already visible on a tapestation run. I see that SALSA for example normalizes the link counts based on the number of restriction sites found on scaffolds. Does 3d-dna have this option ?

Thank you for your help!

Best Regards Ray

dudcha notifications@github.com 于 2020年1月18日周六 上午12:23写道:

Ray,

My questions remain the same as is the request to move to forum. These are not discussions of bugs. Please share your out and err streams for this to be at all informative. I cannot tell from your images if .assemblies are loaded or not. I don't know what you mean by 'feeding 19 chromosomes into the program'. If you are running with default mapq cutoff these are not mismapped reads. You can zoom in on those questionable regions and figure out if they are misjoins: this is what 3d-dna dumps maps for. Use .wig and .bed files associated with each step to understand what gets annotated as misassemblies/chromosome boundaries etc. and why: too little data (tracks are unsaturated), coverage biases (too much gets annotated as repeats) etc.

Hope this helps, Olga

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/theaidenlab/3d-dna/issues/74?email_source=notifications&email_token=AAMVRQN6NOROGGQL4SZ2UH3Q6I4VJA5CNFSM4KHVR4F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJIEKQ#issuecomment-575832618, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVRQOGF4V3GWMEA2EUSRTQ6I4VJANCNFSM4KHVR4FQ .