bcgsc / LINKS

⛓ Long Interval Nucleotide K-mer Scaffolder
GNU General Public License v3.0
73 stars 15 forks source link

working with 2 Gb genome; stuck on bloom filter being built #20

Closed devonorourke closed 6 years ago

devonorourke commented 6 years ago

Hi Rene, I've tried submitting a LINKS job using a few different parameters (including your recent suggestion by modifying the -t and -d parameters, but unfortunately my observations remain the same: the program is stuck on building the bloom filter for at least 24 hours. Perhaps it needs even longer before any additional messages are written to the .log file?

I have access to a variety of machines on our computer cluster with anywhere from 128 Gb to 995 Gb of RAM. I've been using the more readily available smaller machines to run LINKS so far, but I don't get a sense that the issue is memory at the moment.

I'm working with a 2 Gb genome and trying to close up an existing assembly with Nanopore reads. I have about 30 Gb of Nanopore data thus far. The two commands I've tried which left me with the same sort of holding pattern on the bloom filter portion were:

(default parameters)

LINKS -f {my.fasta} -s {myreadslist}

then following your suggestions, as well as modifying the kmer size

LINKS -f {my.fasta} -s {myreadslist} -t 50 -d 5000 -k23

Perhaps I need to provide more memory? I've attached a typical .log file that is generated below. Thank you for any information you can provide which might help me get over the bloom filter step.

Running: /mnt/lustre/software/linuxbrew/colsa/bin/LINKS [v1.8.6]
-f mylu.fa
-s readsfile
-m 
-d 5000
-k 23
-e 0.1
-l 5
-a 0.3
-t 50
-o 0
-z 500
-b mylu
-r 
-p 0.001
-x 0

----------------- Verifying files -----------------

Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_71710fab0f2bebd2a96ab42e073d0292ec138392.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_fa6f064a9f2c29d5f8dcdd435c5cec74e8bf3a04.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_8a1d68cce34fae5417a223dfd574649ed98d966a.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_6642b51f73438e475c0b83bd2fd69e99a632038e.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_f77e1c24e730f2e852318bbe697ec6c8420a1a85.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_76a6fa268897bf1a1be7a57f36b1fa6d4b6e9c24.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_4062f55aba7613d771243721850b42921b756677.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_93db73a2dc228333583e0ebe7acc862ca86ef30f.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_00f228b639289da986865a2d08a3a26f0d8aa592.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run1/workspace/pass/fastq_runid_58b24ffcdefabc7d926abbfc0522bb0c7800989f.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_5e30f9268bf9cb4bfd5563130c37f4027d287302.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_6ddb3559f7f102736fccb5ad6f4ec8e0d3bbe70d.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_f8b124142fd75e2fc0fcd28fe9e9d4bd944b0d02.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_d049807b20a620d75ac2de2c114428519c6c1942.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_0cb08f98dc433c79323d84e38d2f88c86a13e2f9.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_96d81b473fa5f5ca9e18c0ee474cb65ebb4d0ba7.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_781736dad98b46b7f47e23a496400586cdc75822.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run2/workspace/pass/fastq_runid_afa5caddbf6e48ff1c9dd653cdc72ee99e4b7b30.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_9b35438e109c2ee6882da220324f3d410586b708.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_ff1e252c580268da42ea57ecb730cfcc39503e87.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_1e4c06d79f4dd9b73d2ed6c270bc31dc5d28e964.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_c279c1faf060eb3dab57a9236d0c086bb03e8840.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_e41283aab878f4ab890838b5799bc5a93fb19531.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_72905c435ba6356f56ec14b83d0e428ad3976a56.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_7bce09f51eff449030d198d1fd8431b66e42f17a.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_1ea55c258d63be75be732bcf3f7fa9d0c4f48668.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_c2b5f2811d94cd980a537281e83f2fecb0afd047.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_4413d6ee11a8d8e92f21b8601659b5012a1c1e81.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_06ec875936967e6a407fed3a8723a65fb98a681f.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_5f12e1c0c79f2f9b0d0c33ad82b7947c5d2fba7e.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_40dffa7bdb63e62b02bf16e3c2d32b409e749061.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_272b541d7da883ee8680cdbcc19e9eef569cb82e.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_de629cff38f9c6be60a7caa350020c4bbd1915c1.fastq...ok
Checking /mnt/lustre/macmaneslab/devon/nanoPore/data/reads/mylu/bat13/albout/run3/workspace/pass/fastq_runid_b5c1afe9d8c9f013ecae46acd988f7945a4fb035.fastq...ok
Checking sequence target file mylu.fa...ok

=>Reading contig/sequence assembly file : Mon Jun 18 11:42:15 EDT 2018
Building a Bloom filter using 23-mers derived from sequences in -f mylu.fa...
warrenlr commented 6 years ago

Building the Bloom filter is usually the fast part..

I suggest you use the writeBloom.pl utility in the tools folder. This should give you some information about the health of the Bloom::Filter module you compiled (and you can try running tests on your system using smaller genomes). Keep me posted. Rene

devonorourke commented 6 years ago

It looks like the program installed by our admin generates this error when I run the writeBloom.pl script:

-bash: ./writeBloom.pl: /home/martink/perl/5.10.0/bin/perl: bad interpreter: No such file or directory

Perhaps that's an indication that the compilation wasn't successful?

warrenlr commented 6 years ago

Above is caused by error while looking for interpreter that doesn't exist. This was in error, I had forgotten to replace the 1st line.

In commit ed8ad09 I replaced the shebang line by /usr/bin/env perl Hopefully it points to the same version as the one used to build the Bloom::Filter

devonorourke commented 6 years ago

Hi Rene, While I'm waiting on my system administrator to edit the shebang I ran a local install of v1.8.5 and it completed without issue. I went from 11,654 scaffolds to 11,638; I'm surprised that was all that was tightened up but perhaps I can now look back and determine what settings are best to modify. Thanks

warrenlr commented 6 years ago

the power of links is in scaffolding using various distance constraints, iteratively.

1) Running links multiple times (when working with large long read dataset / genomes are very big)

eg.

nohup ./runIterativeLINKS.sh 17 beluga.fa 5 0.3 &

./LINKS -f $2 -s ont.fof -b links1 -d 1000 -t 10 -k $1 -l $3 -a $4 ./LINKS -f links1.scaffolds.fa -s ont.fof -b links2 -d 2500 -t 5 -k $1 -l $3 -a $4 -o 1 -r links1.bloom ./LINKS -f links2.scaffolds.fa -s ont.fof -b links3 -d 5000 -t 5 -k $1 -l $3 -a $4 -o 2 -r links1.bloom ./LINKS -f links3.scaffolds.fa -s ont.fof -b links4 -d 7500 -t 4 -k $1 -l $3 -a $4 -o 3 -r links1.bloom ./LINKS -f links4.scaffolds.fa -s ont.fof -b links5 -d 10000 -t 4 -k $1 -l $3 -a $4 -o 4 -r links1.bloom ./LINKS -f links5.scaffolds.fa -s ont.fof -b links6 -d 12500 -t 3 -k $1 -l $3 -a $4 -o 5 -r links1.bloom ./LINKS -f links6.scaffolds.fa -s ont.fof -b links7 -d 15000 -t 3 -k $1 -l $3 -a $4 -o 6 -r links1.bloom ./LINKS -f links7.scaffolds.fa -s ont.fof -b links8 -d 30000 -t 2 -k $1 -l $3 -a $4 -o 7 -r links1.bloom

2) Running links iteratively in a single command (works best when genomes are small/RAM not limiting)

./LINKS -f $2 -s ont.fof -b links1 -d 1000,5000,7500,10000,12500,15000,30000 -t 10,5,5,4,4,3,3,2 -k $1 -l $3 -a $4

*the value of -t will determine the #kmer pairs extracted and incidentally the RAM used. this will vary from one dataset to the next. I also recommend correcting ONT reads if you can, it will allow you to choose higher k values and increase the specificity. If not, no big deal (but are stuck choosing k15-17).

devonorourke commented 6 years ago

Hi again René,

I've gained access to a computer that has lots of memory and recently ran a LINKS job following what I think emulates the parameters you've suggested above with the beluga.fa script. Perhaps I'm mistaking some critical parameters, but I believe I've set things in the job as you've described...

The job failed despite allocating 1.2 Tb of memory, which makes me think that I'm doing something wrong right at the outset - do most LINKS jobs require this much memory? The job failed when scaffolding, after building the initial bloom filter. The command was:

Running: /pylon5/mc3bg6p/devono/software/links_v1.8.5/LINKS [v1.8.5]
-f /pylon5/mc3bg6p/devono/alignments/mylu/mylu_bwa/mylu.fa
-s /pylon5/mc3bg6p/devono/assembly/links/mylu/mylu.fof
-m 
-d 1000
-k 15
-e 0.1
-l 5
-a 0.3
-t 10
-o 0
-z 500
-b mylu1000
-r 
-p 0.001
-x 0

The job stopped after about 12 hours. The tail of the .log file suggested that the issue was exceeding my 1.2 Tb memory allocation. There appear to be many, many kmers extracted. This has me wondering if I should be altering something about that above command?

Extracted 7481580 15-mer pairs, from all 5008903 sequences provided in /pylon5/mc3bg6p/devono/assembly/links/mylu/mylu.fof

Extracted 2556027163 15-mer pairs overall. This is the set that will be used for scaffolding

=>Reading sequence contigs (to scaffold), tracking k-mer positions: Wed Jul  4 21:11:58 EDT 2018
Contigs (>= 500 bp) processed k=15:
418slurmstepd: error: Job 3427612 exceeded memory limit (1258291556 > 1258291200), being killed
slurmstepd: error: Exceeded job memory limit

This was to be the first of several iterations, as you've suggested. I wonder if the first pass, given the parameters used above, is the most memory intensive of all? This was my overall intention:

LINKSPATH=/pylon5/mc3bg6p/devono/software/links_v1.8.5/LINKS
INFASTA=/pylon5/mc3bg6p/devono/alignments/mylu/mylu_bwa/mylu.fa
FOFPATH=/pylon5/mc3bg6p/devono/assembly/links/mylu/mylu.fof

$LINKSPATH -f $INFASTA -s $FOFPATH -d 1000 -t 10 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu1000.fa -s $FOFPATH -b mylu2500 -d 2500 -t 5 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu2500.fa -s $FOFPATH -b mylu5000 -d 5000 -t 5 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu5000.fa -s $FOFPATH -b mylu7500 -d 7500 -t 4 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu7500.fa -s $FOFPATH -b mylu10000 -d 10000 -t 4 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu10000.fa -s $FOFPATH -b mylu12500 -d 12500 -t 3 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu12500.fa -s $FOFPATH -b mylu15000 -d 15000 -t 3 -k 15 -l 5 -a 0.3 -r mylu1000.bloom
$LINKSPATH -f mylu15000.fa -s $FOFPATH -b mylu30000 -d 30000 -t 2 -k 15 -l 5 -a 0.3 -r mylu1000.bloom

Any suggestions on how to modify parameters to reduce memory usage would be greatly appreciated. My advisor suspects that the issue is the number of unique kmers, and that this is likely due to the high error rate in the Nanopore data itself. Perhaps this speaks to your recommendation of trying to error correct these reads before running LINKS. I'm curious if you've used LINKS with much ONT data before, and if so if there are suggested best practices to follow prior to executing this program.

Thanks René!

warrenlr commented 6 years ago

You'll have to stay under 1 TB.

This is the tricky part, because it varies based on long read quality, coverage and the size of your genome.

To achieve this, you will have to tune -t Yes, you seem to have a lot of ONT reads, hence the large #kmer pairs set extracted 2,556,027,163.

There are guidelines in the readme re: how to curb memory usage. Essentially starting off with a high step size (-t) for short distances.

I recommend using -t 150 with -d 1000 to begin. Monitor the mem. usage and adjust -t accordingly. As -d increases, -t must decrease (otherwise you'll end up with too few pairs for scaffolding over larger kmer distances). Also, you might have to adjust -k. The sweet spot will be somewhere k 15-19 for your 2Gb genome (assuming raw ONT reads).

devonorourke commented 6 years ago

Thanks René,

The latest run using -t 150 and -d 1000 completed; this resulted in merging just one pair of contigs among around 70,000. I'm wondering where to go from here.

The job extracted about 171M 15-mer pairs overall; in addition, there were about 500k unique kmer pairs. What's surprising to me is that there was only one instance in which a pair of contigs could be joined. Of the 72784 contigs, it appears there is only one instance identified in the *assembly_correspondence.tsv file where an overlap exists:

LINKS_scaffold_ID   LINKS_contig_ID original_name   orientation(f=forward/r=reverse)    number_of_links links_ratio gap_or_overlap(-)
scaffold65692   54658   contig_54657    f   8   0   -80

Perhaps you have a suggestion of what the next parameter changes should be. In your previous post you mentioned this decision might be informed by memory consumption of the first run. Unfortunately I haven't found an programatic way to monitor the memory usage on this cluster as the run progresses; however I can pull together a summary report for the job. Here's a sample:

       JobID    JobName    Elapsed     MaxRSS     MinCPU     AveCPU    AveDiskRead  MaxDiskRead   AveDiskWrite MaxDiskWrite  AveVMSize  MaxVMSize     ReqMem   TotalCPU 
------------ ---------- ---------- ---------- ---------- ---------- -------------- ------------ -------------- ------------ ---------- ---------- ---------- ---------- 
3496821         LUlinks   06:29:11                                                                                                                     840Gn   06:29:52 
3496821.bat+      batch   06:29:11      6832K   00:00:06   00:00:06         38.42M       38.42M         37.98M       37.98M    390808K    390808K      840Gn  00:06.123 
3496821.ext+     extern   06:29:12      1200K   00:00:00   00:00:00          0.00M        0.00M              0            0    107904K    140452K      840Gn  00:00.002 
3496821.0         LINKS   06:29:04 233672268K   06:28:20   06:28:20      73874.25M    73874.25M       9046.68M     9046.68M          0 234208308K      840Gn   06:29:46

The bloom filter itself isn't particularly large:

Bloom filter specs
elements=677791916
FPR=0.001
size (bits)=9745012672
hash functions=9
...
=>Writing Bloom filter to disk (mylu1.bloom): Fri Jul  6 01:33:23 EDT 2018
Storing filter. Filter is 1218126584 bytes.

And the reads themselves shouldn't change in terms of their total memory allocation moving forward (assuming I use the same reads):

=>Reading long reads, building hash table: Fri Jul  6 01:33:24 EDT 2018
Reads processed k=15, dist=1000, offset=0 nt, sliding step=150 nt:
... (a lot of "Reads processed from file lines ) ...
Reads processed from file 46/46, /pylon5/mc3bg6p/devono/reads/nanopore/mylu/albout/run4/fastq_runid_e51c6e4c8f8ae0451a5b38a5123c82bde7f5b36a.fastq:
5008903
Extracted 501246 15-mer pairs, from all 5008903 sequences provided in /pylon5/mc3bg6p/devono/assembly/links/mylu/mylu.fof
Extracted 171359519 15-mer pairs overall. This is the set that will be used for scaffolding

I'm still unclear how to evaluate my known total memory consumption (~230 Mb I think) with the processes that use up the memory (presumably the extracted kmers and ONT data?). I'm curious what recommendations you might make for the next iteration and am wondering how altering the -t and -d parameters will change the next iteration of memory usage. Do they scale linearly?

Thanks for your consideration

warrenlr commented 6 years ago

Glad it's working in your hands. fyi the bloom filter only contains your draft assembly kmers, so it is not expected to be very large.

for monitory resource usage (time and peak mem) you may use: /usr/bin/time -v -o mylu1000.time ./LINKS -f XX blah

Based on what you've told me: -t 10 exceed mem and -t 150 uses 230GB (not MB), you may wish to start -d 1000 (-t 100 or -t 75). You'll just have to try. Then, half -t for each doubling of -d (that's a very rough solution); It depends a lot on your read set size and length distribution.. (you have wish to run abyss-fac on your ONT reads)

note: we rarely start at -d 1000 for large genomes (unless we have low read coverage).

Now that you have it working, it is only a matter of parameterizing (-d/-t first, then -k then -a/-l). I'll be away for 3 weeks starting today; good luck!