bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
310 stars 107 forks source link

assembling human genome, get very low N50 and short contigs <1kb #304

Closed ghost closed 4 years ago

ghost commented 5 years ago

Please report

Build

ABySS and its dependencies (openmpi, boost, sparsehash) were installed with conda install. Note that conda somehow could build ABySS without satisfying all requirements (e.g. lacking openmpi, which happen in my first install attempt) and does this silently, I'm not sure if anything else is missing. I don't have sudo privilege on the server, sudo apt install was not an option.

Assembly error

I tried to assemble a paired-end illumina HiSeq 2000 WGS library of human sample NA12878 (more specifically, the fastq files retrieved from SRR622457), on a slurm node where 32GB mem and 8 cpus were allocated for the job. Command line was abyss-pe np=8 name=NA12878_SRR622457_1.fastq.gz k=96 in='/lab/usrname/project/SR_fastqs/NA12878_SRR622457_1.fastq.gz /lab/usrname/project/SR_fastqs/NA12878_SRR622457_2.fastq.gz'.

The assembly finished, logs are stdout and stderr. File list be like: (base) usrname@sever: /lab/usrname/project/SR_fastqs > ls -lh | grep NA12878_SRR622457 -rw-r--r-- 1 usrname lab 48G Oct 7 12:52 NA12878_SRR622457_1.fastq.gz -rw-r--r-- 1 usrname lab 518M Oct 20 23:49 NA12878_SRR622457_1.fastq.gz-1.dot -rw-r--r-- 1 usrname lab 2.3G Oct 20 23:47 NA12878_SRR622457_1.fastq.gz-1.fa -rw-r--r-- 1 usrname lab 0 Oct 20 23:49 NA12878_SRR622457_1.fastq.gz-1.path -rw-r--r-- 1 usrname lab 473M Oct 20 23:51 NA12878_SRR622457_1.fastq.gz-2.dot -rw-r--r-- 1 usrname lab 473M Oct 20 23:50 NA12878_SRR622457_1.fastq.gz-2.dot1 -rw-r--r-- 1 usrname lab 2.2G Oct 20 23:51 NA12878_SRR622457_1.fastq.gz-2.fa -rw-r--r-- 1 usrname lab 83K Oct 20 23:52 NA12878_SRR622457_1.fastq.gz-2.path -rw-r--r-- 1 usrname lab 680K Oct 21 13:57 NA12878_SRR622457_1.fastq.gz-3.dist -rw-r--r-- 1 usrname lab 472M Oct 20 23:52 NA12878_SRR622457_1.fastq.gz-3.dot -rw-r--r-- 1 usrname lab 2.2G Oct 20 23:53 NA12878_SRR622457_1.fastq.gz-3.fa -rw-r--r-- 1 usrname lab 235M Oct 21 13:59 NA12878_SRR622457_1.fastq.gz-3.fa.fai -rw-r--r-- 1 usrname lab 18K Oct 21 10:19 NA12878_SRR622457_1.fastq.gz-3.hist -rw-r--r-- 1 usrname lab 473M Oct 21 13:58 NA12878_SRR622457_1.fastq.gz-4.dot -rw-r--r-- 1 usrname lab 485K Oct 21 13:58 NA12878_SRR622457_1.fastq.gz-4.fa -rw-r--r-- 1 usrname lab 46K Oct 21 13:59 NA12878_SRR622457_1.fastq.gz-4.fa.fai -rw-r--r-- 1 usrname lab 108K Oct 21 13:59 NA12878_SRR622457_1.fastq.gz-4.path1 -rw-r--r-- 1 usrname lab 102K Oct 21 13:59 NA12878_SRR622457_1.fastq.gz-4.path2 -rw-r--r-- 1 usrname lab 101K Oct 21 14:00 NA12878_SRR622457_1.fastq.gz-4.path3 -rw-r--r-- 1 usrname lab 473M Oct 21 14:02 NA12878_SRR622457_1.fastq.gz-5.dot -rw-r--r-- 1 usrname lab 0 Oct 21 14:02 NA12878_SRR622457_1.fastq.gz-5.fa -rw-r--r-- 1 usrname lab 101K Oct 21 14:02 NA12878_SRR622457_1.fastq.gz-5.path -rw-r--r-- 1 usrname lab 0 Oct 21 14:05 NA12878_SRR622457_1.fastq.gz-6.dist.dot -rw-r--r-- 1 usrname lab 472M Oct 21 14:05 NA12878_SRR622457_1.fastq.gz-6.dot -rw-r--r-- 1 usrname lab 2.2G Oct 21 14:04 NA12878_SRR622457_1.fastq.gz-6.fa -rw-r--r-- 1 usrname lab 5.4M Oct 20 23:47 NA12878_SRR622457_1.fastq.gz-bubbles.fa lrwxrwxrwx 1 usrname lab 34 Oct 21 14:05 NA12878_SRR622457_1.fastq.gz-contigs.dot -> NA12878_SRR622457_1.fastq.gz-6.dot lrwxrwxrwx 1 usrname lab 33 Oct 21 14:04 NA12878_SRR622457_1.fastq.gz-contigs.fa -> NA12878_SRR622457_1.fastq.gz-6.fa -rw-r--r-- 1 usrname lab 487K Oct 20 23:53 NA12878_SRR622457_1.fastq.gz-indel.fa lrwxrwxrwx 1 usrname lab 33 Oct 20 23:53 NA12878_SRR622457_1.fastq.gz-unitigs.fa -> NA12878_SRR622457_1.fastq.gz-3.fa -rw-r--r-- 1 usrname lab 67G Oct 7 13:24 NA12878_SRR622457_2.fastq.gz

However, the sequences in the contig.fa or unitig.fa were shorter than expected, with average length of only a few hundred bp. abyss-fac NA12878_SRR622457_1.fastq.gz-unitigs.fa gives: n n:500 L50 min N75 N50 N25 E-size max sum
8122750 258436 109074 500 540 599 703 679 16424 161.7e6

I wonder what would be the problem and how do I fix it? Thanks!

jwcodee commented 5 years ago

I think the issue you are having is because you are using abyss-p with 32GB ram. In the log file, I see Loaded 3566136800 k-mer. At least 143 GB of RAM is required. I suggest using abyss-pe and specifying a Bloom filter size of 25GB (to account for other processes) with the -b option. I think you also aren't scaffolding you reads.

Try this command abyss-pe j=8 name=NA12878_SRR622457 k=96 b=25G lib='pe' pe='/lab/usrname/project/SR_fastqs/NA12878_SRR622457_1.fastq.gz /lab/usrname/project/SR_fastqs/NA12878_SRR622457_2.fastq.gz'.

If possible, I would also suggest running on a machine with more memory. The lowest I've tried running abyss is 40GB.

ghost commented 5 years ago

@jowong4 Thanks, I overlooked that line - I had some jobs assembling other samples killed presumably because of insufficient memory (signal 9), so I thought this one was just a benevolent warning.

An extra question regarding the command you suggested: is lib='pe' pe='...' and in='...' interchangeable for paired short read data? With in='...' flag, ABySS was happy with my fastq file's naming format, where in the first file reads were: @SRR622457.1 1/1 @SRR622457.2 2/1 @SRR622457.3 3/1 @SRR622457.4 4/1 @SRR622457.5 5/1

and in the second file: @SRR622457.1 1/2 @SRR622457.2 2/2 @SRR622457.3 3/2 @SRR622457.4 4/2 @SRR622457.5 5/2

But lib='pe' pe='...' appears to be more stiff and throws Abyss-fixmate: error: all reads are mateless. I have a bunch of samples to process, so if ABySS could assume the pairing information with in='...' then I'd love to not to rename all the reads. (I'm trying Bloom filter + larger mem with in='...', will keep you posted.)

For scaffolding, I assume it does need extra data if the illumina pe data themselves do not count as 'long-distance mate-pair libraries'. Though I'm working on NA12878 and a few benchmarking-ish samples, we expect to analyze / utilize abyss on datasets that don't come with the extra bits.

mmokrejs commented 5 years ago

@arandomlettuce Either run your fastq files though a tools witll will double-check that ordering of reads in both files is correct and which will move singletons out. Then, the tool/you can edit the readnames to be same in both mates. Or, and that's what I do recommend, re-fetch the data from NCBI SRA database with fastq-dump (from sratoolkit) and use --origfmt option. Most likely you will get usual readnames if original submitter did not mangle them.

ghost commented 5 years ago

@mmokrejs Thank you, that is indeed required. Found this morning that although abyss-pe didn't complain about the naming upon start, with flag in='...', the job aborted at a later stage (got .fastq.gz-6.dot file not found error). Will fix the names.

ghost commented 4 years ago

I'm closing this for now as it's mostly my config problem, and we decided to prioritize the compute resource on other stuff in the meantime (and maybe get back to this much later). Thanks for the help!