aidenlab / 3d-dna

3D de novo assembly (3D DNA) pipeline
MIT License
206 stars 55 forks source link

Unable to reproduce AaegL4 assembly #1

Closed skoren closed 6 years ago

skoren commented 7 years ago

I've downloaded and run the software and can successfully reproduce the NA12878 assembly. However, I can't reproduce the AaegL4 assembly from the paper using the publicly available data. I ran with the command:

sh 3d-dna/run-pipeline.sh -m diploid -t 10000 -s 9 -c 3 AaegL2.fasta GSE95797_AaegL2.mnd.txt

The md5s for the inputs are:

aeb2c78b41c8fc2039756d42dabe2805 AaegL2.fasta
eda1ae0aba0cf4b231c12b49de0df89e GSE95797_AaegL2

As far as I can tell the pipeline runs without error but I end up with one unspilt scaffold of 1.1Gbp and all the rest <2mb rather than the expected 3. I am guessing something is failing in the rabl splitting code but I didn't see any obvious errors in the output. Let me know if there are intermediate files or program output I can share to help diagnose this issue.

dudcha commented 7 years ago

Hi Sergey,

Perhaps you can share the stdout? I can compare to the one we ran before pushing and try to find a divergence point.

Best, Olga

skoren commented 7 years ago

Sure, output attached. run.out.gz

dudcha commented 7 years ago

Sergey,

Something appears to go differently for you around step 8. Can you share the asm.7.cprops and asm.7.asm?

Best, Olga

skoren commented 7 years ago

step7.tar.gz Here is everything with the name *.7.* except the hic file since that was >800mb

dudcha commented 7 years ago

Hi Sergey,

It seems the megascaffold output in step 7 is inverted in your run as compared to what we have in history, which leads to downstream differences. I will need some time to investigate what could lead to this and if I can replicate. Do you think you could tell me the sort, awk and parallel versions? Thank you,

Olga

skoren commented 7 years ago
$ sort --version
sort (GNU coreutils) 8.27
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.
$ parallel --version
GNU parallel 20150722
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --bibtex'.
$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

It seems the awk is older than the version listed in the README, perhaps that is the issue. I can try updating awk and seeing what happens.

skoren commented 7 years ago

I re-ran the assembly with awk 4.0.2 and I now get the same assembly as in the paper so it seems something between awk 3.1.7 and 4.0.2 is causing the change in assembly.

dudcha commented 7 years ago

Sergey, Thank you for letting us know. With best wishes, Olga