bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
310 stars 107 forks source link

Using abyss-scaffold to scaffold a DISCOVAR de novo assembly #121

Closed mebbert closed 6 years ago

mebbert commented 8 years ago

Hi,

I really appreciate your assistance. I'm trying to use instructions you provided on how to "trick" Abyss into scaffolding a Discovar de novo assembly, as posted here: Using the ABySS pipeline to scaffold a DISCOVAR de novo assembly. I have tried several approaches, so I hope you can follow what I'm doing. Please let me know if I can clarify anything!

I first tried using DDN's a.lines.fasta as the '-6' file, but AdjList complained about 'N' being an invalid character. That makes sense, since that's DDN's scaffold. After that, used DDN's a.fasta. I also used the awk statement provided at the BioStars post to change the fasta headers to match Abyss's requirements: awk '/^>/{print ">" ++i;next}{print}' < ../a.final/a.fasta > a-6.fa.

I then ran AdjList -v -k31 -m50 --dot a-6.fa > a-6.dot (where -k is an arbitrary value since it's tough to know what the best value is). After a short run, AdjList failed with the following error:

error: unexpected character: '>' AdjList: Sequence.cpp:103: uint8_t baseToCode(char): Assertion `false' failed.

I interrogated a-6.fa to see if there were any aberrant '>' characters within the file, but I didn't find any. I then found that DDN also gives out a .dot file, so I gave up on running AdjList on a-6.fa. I'm not certain if it's appropriate, but I used DDN's a.lines file for a-6.dot. Well, I never got far enough to see if that would satisfy Abyss's needs. When I ran Abyss with the -n parameter along with the -o a-6.fa -o a-6.dot, Abyss still wants to run AdjList on a-1.fa.

Questions:

  1. Any idea why AdjList would complain about a > character?
  2. How do I keep Abyss from trying to run AdjList on a-1.fa since I don't have it?
  3. Do you know if I can use DDN's a.lines as the .dot file?
  4. Any recommendation on the -k parameter?

I appreciate any advice you can provide. Take care!

Mark

sjackman commented 8 years ago

Hi, Mark.

Any recommendation on the -k parameter?

I believe (I forget where I read this) that DISCOVARdenovo internally uses k=200, so try AdjList -k200 -m50.

error: unexpected character: '>' AdjList: Sequence.cpp:103: uint8_t baseToCode(char): Assertion `false' failed.

Any idea why AdjList would complain about a > character?

To troubleshoot this error, try:

abyss-tofastq a-6.fa | wc

I then found that DDN also gives out a .dot file, so I gave up on running AdjList on a-6.fa. I'm not certain if it's appropriate, but I used DDN's a.lines file for a-6.dot. Do you know if I can use DDN's a.lines as the .dot file?

DISCOVARdenovo 52488 did not give me a dot file. Are you certain the dot file came from DDN?

For me, the a.lines file is a binary blob, not a GraphViz .dot file.

When I ran Abyss with the -n parameter along with the -o a-6.fa -o a-6.dot, Abyss still wants to run AdjList on a-1.fa. How do I keep Abyss from trying to run AdjList on a-1.fa since I don't have it?

It's surprisingly frustrating trying to get make to start from that stage and ignore previous stages. Try this command:

https://github.com/sjackman/giab/blob/master/discovardenovo/Makefile#L67-L70

mebbert commented 8 years ago

I really appreciate the feed back!

I originally tried -k200 but AdjList gave the following error, suggesting the max is 96:

AdjList: ../Common/Kmer.h:49: static void Kmer::setLength(unsigned int): Assertion `length <= 96' failed.

I guess I'll try 96.

The command abyss-tofastq a-6.fa | wc fails with the same error:

error: unexpected character: '>' abyss-tofastq: Sequence.cpp:103: uint8_t baseToCode(char): Assertion `false' failed.

Regarding the .dot file, I haven't ever looked at one. DDN's FAQ states the following:

DISCOVAR generates dot files which may be viewed with Graphviz.

I wasn't sure which file, but the description of a.lines sounded like it could be. :) I went through all of the output and didn't see anything else. After looking up a dot file example, I'm not sure DDN actually provides it. Thanks!

Thanks for pointing to me to that command. It looks like it depends on some "intermediate" make file. I'm not sure what exactly that should look like. Am I missing something?

Thanks again!

sjackman commented 8 years ago

I originally tried -k200 but AdjList gave the following error, suggesting the max is 96:

You'll need to recompile ABySS using configure --enable-maxk=256. You can also do this using Homebrew / Linuxbrew with brew install homebrew/science/abyss --with-maxk=256

sjackman commented 8 years ago

Thanks for pointing to me to that command. It looks like it depends on some "intermediate" make file.

Here you go: https://github.com/sjackman/giab/blob/master/discovardenovo/intermediate.mk

It has the same effect as adding the -o option to abyss-pe (which is actually make) for all the named intermediate files.

mebbert commented 8 years ago

Thanks again! I noticed in your makefile that you tried using BESST, too. What are your impressions on that scaffolder?

sjackman commented 8 years ago

I was pretty happy with it. I found it easy to run too. Check out the ABySS 2.0 paper Figure 2B and Table 2 for the results on this one genome with ABySS-Scaffold, LINKS and BESST. http://biorxiv.org/content/early/2016/08/07/068338

mebbert commented 8 years ago

Interesting! I'd like to do a simple comparison between ABySS 2.0 scaffolder and BESST on our genome. I'm happy to share the data with you and get your feedback, if you're interested. Based on your paper, it looks like ABySS 2.0 is easier to use only the scaffolder? Is that accurate? As I mentioned, we have a DDN assembly.

sjackman commented 8 years ago

Yes, please do try both scaffolders on your genome. Be sure to use the option -S200-20000 to try a range of values for S and pick the one that gives the best N50. I'd like to add a feature to do the same for optimizing N.

No, unfortunately. It's not any easier to run ABySS-Scaffold standalone in ABySS 2.0. I would like to make that happen though.

mebbert commented 8 years ago

Oh, bummer. I'd really like to try it, but it has been a loooong time since I've dealt with makefiles. I gave up on your previous instructions. :-/ Also, I couldn't figure the problem with the > character with AdjList. If you're willing to help me figure it out, I'd love to give it a shot.

sjackman commented 8 years ago

If AdjList isn't working for you, you can use abyss-todot to create a fake overlap graph that contains only the contigs, but not their overlaps.

Try the following shell script. It takes as input a FASTA file of contigs named assembly-contigs.fa and a file of mate-pair reads named mp6k_1.fq mp6k_2.fq and produces a FASTA file named assembly-scaffolds.fa.

abyss-todot -k200 assembly-contigs.fa >assembly-contigs.dot
abyss-map -j4 -l50 mp6k_1.fq mp6k_2.fq assembly-contigs.fa \
    | abyss-fixmate -l50 -h mp6k-6.hist \
    | sort -snk3 -k4 \
    | gzip >mp6k-6.sam.gz
gunzip -c mp6k-6.sam.gz \
    | DistanceEst --dot --mean -j4 -k200 -l50 -s200-20000 -n1 -o mp6k-6.dist.dot mp6k-6.hist
abyss-scaffold -k200 -s200 -n10 -g assembly-6.path.dot assembly-6.dot mp6k-6.dist.dot >assembly-6.path
MergeContigs -k200 -o assembly-scaffolds.fa assembly-contigs.fa assembly-6.dot assembly-6.path

DistanceEst also identifies contigs that actually overlap (have a negative distance), but it won't merge the overlapping sequence unless the two contigs also has an overlap edge in the overlap graph (which it won't if the overlap graph is empty)

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Leo-alves commented 6 years ago

Hi there, I've come across the "unexpected characater: " error. After a few days trying to understand/solve it I could find some tricky things, but still no fixing.

(paired end reads, same order, headers end with /1 /2 and all reads are paired) mpirun --oversubscribe -np 12 abyss-pe -C out_abyss name=nana k=32 in='~/read1.250bp.fastq.gz ~/read2.250bp.fastq.gz' [...] Assembled 6445199 k-mer in 3283 contigs. Removed 34889148 k-mer. The signal-to-noise ratio (SNR) is -7.24406 dB. Pruning tips shorter than 8 bp... Pruning tips shorter than 32 bp... AdjList -k32 -m50 --dot nana-1.fa >nana-1.dot error: unexpected character: ' (my comment: depending on the input reads it prints '>') AdjList: Sequence.cpp:103: uint8_t baseToCode(char): Assertion false failed. make: [nana-1.dot] Aborted make: Deleting file `nana-1.dot' [...]

I've realized the -1.fa files are sometimes binary (because of a bug?). Adjlist calls the (binary) file with the following command: Adjlist -k32 -m50 --dot nana-1.fa >nana-1.path Looking into other ...-1.fa files that I have from previous attempts to run Abyss (with different input reads) I could have found they are NOT binaries.

grep ">" bla-1.fa '>0 201 276 '>1 233 340

However, grep ">" nana-1.fa Binary file nana-1.fa matches

I then went deep into the binary file and have found two things: 1) if I cat or open it with VI and them :wq! it just turn into text file and never come back to binary again!! grep ">" nana-1.fa Binary file nana-1.fa matches cat nana-1.fa '>0 34 235 ACCTACTTCCGCAACGACTACAAGAACAAGATCG '>1 1036 69086 CCATCCGCCGGCCCAGCTCTACCAGTTGGACGTC and so on... then again grep ">" nana-1.fa '>0 34 235 '>1 1036 69086

Interestingly, for a few lines the headers are merged to the sequence, like that: ... '>443 1591 GGGTTCGGGCTTTCGTCTCGA... and so on ...

2) Opening with vi brought me that: in those lines the headers are text but the sequence is PART or FULLY binary. vi nana-1.fa ... '>443 1591 ^@^@^@^@^@^@^@^@^@^@^@^@.. and so on ... (I am not printing the part binary because it's large) Problem is that the next sequence is not >444 but >1118 instead, which means I am loosing hundreads of sequences.

I then cut the binary parts of the nana-1.fa file and run again Adjlist and this time it worked: Adjlist -k32 -m50 -v --dot nana-1.fa Reading `nana-1.fa'... Finding overlaps of exactly k-1 bp... V=2838 E=1736 E/V=0.612 Degree: ██_ 01234 0: 47% 1: 44% 2-4: 8.3% 5+: 0% max: 3 digraph adj { graph [k=32] edge [d=-31] "0+" [l=34 C=235] "0-" [l=34 C=235] "1+" [l=1036 C=69086] ... and so on

However, running abyss-pe again returned another error.

make: Entering directory ~abyss-2.0.3/out_abyss AdjList -k32 -m50 --dot nana-1.fa >nana-1.dot make: Entering directory ~/abyss-2.0.3/out_abyss AdjList -k32 -m50 --dot nana-1.fa >nana-1.dot make: Entering directory ~/abyss-2.0.3/out_abyss AdjList -k32 -m50 --dot nana-1.fa >nana-1.dot make: Entering directory ~/abyss-2.0.3/out_abyss AdjList -k32 -m50 --dot nana-1.fa >nana-1.dot make: Entering directory ~/abyss-2.0.3/out_abyss abyss-filtergraph --dot -k32 -g nana-2.dot1 nana-1.dot nana-1.fa >nana-1.path make: Entering directory ~/abyss-2.0.3/out_abyss abyss-filtergraph --dot -k32 -g nana-2.dot1 nana-1.dot nana-1.fa >nana-1.path make: Entering directory ~/abyss-2.0.3/out_abyss abyss-filtergraph --dot -k32 -g nana-2.dot1 nana-1.dot nana-1.fa >nana-1.path abyss-filtergraph: ../Graph/AdjIO.h:106: std::istream& read_adj(std::istream&, ContigGraph&, BetterEP) [with Graph = DirectedGraph<ContigProperties, Distance>; BetterEP = DisallowParallelEdges; std::istream = std::basic_istream]: Assertion num_vertices(g) > 0 failed. make: [nana-1.path] Aborted make: Deleting file nana-1.path make: Leaving directory ~/abyss-2.0.3/out_abyss abyss-filtergraph: ../Graph/AdjIO.h:106: std::istream& read_adj(std::istream&, ContigGraph&, BetterEP) [with Graph = DirectedGraph<ContigProperties, Distance>; BetterEP = DisallowParallelEdges; std::istream = std::basic_istream]: Assertion num_vertices(g) > 0 failed.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

make: [nana-1.path] Aborted make: Deleting file nana-1.dot make: Deleting file nana-1.dot make: [nana-1.dot] Terminated make: Deleting file nana-1.dot make: [nana-1.dot] Terminated make: Deleting file nana-1.dot make: [nana-1.dot] Terminated make: [nana-2.dot1] Terminated make: [nana-1.path] Terminated make: *** [nana-1.path] Terminated make: Leaving directory ~/abyss-2.0.3/out_abyss make: Entering directory ~/abyss-2.0.3/out_abyss abyss-filtergraph --dot -k32 -g nana-2.dot1 nana-1.dot nana-1.fa >nana-1.path make: Entering directory ~/abyss-2.0.3/out_abyss abyss-filtergraph --dot -k32 -g nana-2.dot1 nana-1.dot nana-1.fa >nana-1.path

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[36561,1],5] Exit code: 2

Well, I am loosing contigs from the .fa file and don't know how to fix it. I am still returning errors indeed. Could you be so kind as to help me fixing it?

benvvalk commented 6 years ago

@Leo-alves I just want to check... Is your problem actually related to the subject of this ticket (scaffolding DISCOVAR assemblies with ABYSS-Scaffold)? If not, can you please create a new ticket with an appropriate subject line?

One issue I notice with your command line is that you are not supposed to mpirun directly with abyss-pe -- the abyss-pe script already calls mpirun internally. To enable an MPI run of abyss-pe, you should add the np parameter to your abyss-pe command to indicate the number of MPI processes to use (e.g. np=32).

If you have multiple processes all writing to the same output files (due to nested invocations of mpirun), that would explain the garbled binary content in your files.

Leo-alves commented 6 years ago

Dear @benvvalk, thanks for replying. Yes, the problem may be involved with the previous subject of this thread: "AdjList unexpected character: '>' ". I am moving to a new one. Regarding the command line, sorry but I don't think I clearly understand. Should I not call mpirun --oversubscribe --np 12 abyss-pe... but do mpirun --oversubscribe abyss-pe np=12 ?

benvvalk commented 6 years ago

@Leo-alves

I mean that you should run it like:

abyss-pe -C out_abyss np=12 name=nana k=32 in='~/read1.250bp.fastq.gz ~/read2.250bp.fastq.gz'

instead of:

mpirun --oversubscribe -np 12 abyss-pe -C out_abyss name=nana k=32 in='~/read1.250bp.fastq.gz ~/read2.250bp.fastq.gz'

Please see https://github.com/bcgsc/abyss#parallel-processing for a further discussion.

Leo-alves commented 6 years ago

Thank you @benvvalk. Got it now! I am giving up of opening a new thread for this particular problem because I was running the command the wrong way and that is the why I've got the error.

benvvalk commented 6 years ago

@Leo-alves Sure, no problem. Good luck!

verofumero commented 6 years ago

Dear Ben @benvvalk,

I am trying to do scaffolding with contigs from a DDN assembly I did last year (I am also re-assembling my reads with Abbys). My question is which are the input files I need? Do I need to have mate-pair sequencing as input? or, can I use the DDN-contigs and do scaffolding of reads against them? Another option, can I use scaffolds from an incomplete reference genome (same species than mine) which is online available?

Very thanks,

Vero.-

sjackman commented 6 years ago

You can scaffold using abyss-scaffold with your Discovar de novo assembly and ideally mate-pair reads (be sure to run nxTrim if they're Nextera reads). If you have 10x Genomics Chromium then use ARCS. If you have Oxford Nanopore, then use LINKS. I'm not terribly familiar with reference-based scaffolding, but we have a tool RAILS and Cobbler for that purpose. https://github.com/bcgsc/arcs https://github.com/bcgsc/links https://github.com/bcgsc/rails

benvvalk commented 6 years ago

Thanks, @sjackman!

verofumero commented 6 years ago

Thanks a lot! I will check references,

Vero.-

verofumero commented 6 years ago

Hello! I am trying abyss-map and getting this error:

abyss-map -v 298_1.fastq 298_2.fastq 298.fa Reading `298.fa'... Assertion failed: (c == '>'), function index, file ../DataLayer/FastaIndex.h, line 83. Abort trap: 6

What does it means? Can you help me, please? Thanks!

sjackman commented 6 years ago

The FASTA file must not have line breaks. Sorry for the cryptic error message. Try…

seqtk seq 298.fa >298.seqtk.fa

See https://github.com/lh3/seqtk

verofumero commented 6 years ago

Thans Shaun! I am happy having your replies!

I am having another issue, since runnig abyss-map seems to go good but then I can not find any output files. I was reading some old posts like this:

Hi Nat,

The abyss-map algorithm finds the longest perfectly matching common substring of the query sequence and the target sequence. In other words, it's a local alignment requiring 100% identity. BWA and bowtie are both glocal aligners, meaning the entirety of the query sequence must be matched. abyss-map is good at mapping to small contigs (possibly shorter than the read length) and mapping reads that overhang past the end of the contig.

The SAM output by abyss-map should be standard SAM. If you find any issue, please report it to me. Note that the sequence and quality columns are both set to * by default, as ABySS does not require this information. This can be changed as a compile-time option.

The target sequence must be smaller than 6 exabytes.

Cheers, Shaun ...and it seems to be that a .SAM file is generated by abyss... what can be the problem in my case? Do you have any idea? Storage available space maybe?

Thanks again,

Vero.-

Dra. María Verónica Fumero Área Micología Depto. Microbiología e Inmunología Facultad de Cs. Exactas, FQ y Naturales Universidad Nacional de Rio Cuarto Tel.: +54 (0358) 4676 429 Ruta Nacional 36, Km 601 C.P.: 5800, Rio Cuarto, Cordoba, Argentina

On Wed, Jun 20, 2018 at 3:39 PM, Shaun Jackman notifications@github.com wrote:

The FASTA file must not have line breaks. Sorry for the cryptic error message. Try…

seqtk seq 298.fa >298.seqtk.fa

See https://github.com/lh3/seqtk

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcgsc/abyss/issues/121#issuecomment-398854215, or mute the thread https://github.com/notifications/unsubscribe-auth/Amf7zUBj9gcDWdOXkn-FzKO_CyY093sPks5t-pb5gaJpZM4JfXgq .

sjackman commented 6 years ago

The SAM output is written to standard output (stdout).

abyss-map -v 298_1.fastq 298_2.fastq 298.fa | gzip >298.sam.gz

I recommend using pigz, which is parallelized, rather than gzip.