DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

HISAT-3N indexing with SNPs never seems to have enough memory? #332

Open aleighbrown opened 2 years ago

aleighbrown commented 2 years ago
.........
COUNT NUMBER IN EACH BIN: 22
FINISHED FIRST ROUND: 38
1 3017904516
0 0
0 0
0 0
FINISHED RECURSIVE SORTS: 1142
SORT NODES: 1202
MERGE, UPDATE RANK: 156
Generation 4 (3017904516 -> 3017902501 nodes, 42907013 ranks)
Out of memory while constructing suffix array.  Please try using a smaller
number of blocks by specifying a smaller --bmax or a larger --bmaxdivn
Total time for call to driver() for forward index: 01:46:41
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -p 4 --bmaxdivn 128 --base-change T,C --snp /SAN/vyplab/vyplab_reference_genomes/hisat-3n/wtc11_vcf.snp --3N /SAN/vyplab/vyplab_reference_genomes/sequence/human/gencode/GRCh38.primary_assembly.genome.fa /SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp 
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.1.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.2.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.3.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.4.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.5.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.6.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.7.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.8.ht2l" file written during aborted indexing attempt.

Below is my submission script

#!/bin/bash
#Submit to the cluster, give it a unique name
#$ -S /bin/bash

#$ -cwd
#$ -V
#$ -l h_vmem=70G,h_rt=72:00:00,tmem=70G
#$ -pe smp 4

# join stdout and stderr output
#$ -j y
#$ -R y

/SAN/vyplab/alb_projects/tools/hisat-3n/hisat-3n-build -p 4 \
--bmaxdivn 128 \
--large-index --base-change T,C \
--snp /SAN/vyplab/vyplab_reference_genomes/hisat-3n/wtc11_vcf.snp \
/SAN/vyplab/vyplab_reference_genomes/sequence/human/gencode/GRCh38.primary_assembly.genome.fa  /SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp

At this point I've given it 280G of memory and I keep increasing the --bmaxdivn but without success.

/SAN/vyplab/vyplab_reference_genomes/hisat-3n/wtc11_vcf.snp is a SNP file from the called germline variants of the cell line we're using which I generated from the included script

It has 5,026,503 lines

I wonder if it's not really been designed to deal with that many SNPs?

imzhangyun commented 2 years ago

Hello @aleighbrown ,

From this line: #$ -l h_vmem=70G,h_rt=72:00:00,tmem=70G, it looks like you limited the memory usage to 70GB. I recommend to set the memory limit to 250GB.

Leo

aleighbrown commented 2 years ago

Hi Leo,

It's 4 x 70 actually, the line below

On Tue, Oct 26, 2021, 7:58 PM Yun (Leo) Zhang @.***> wrote:

Hello @aleighbrown https://github.com/aleighbrown ,

From this line: #$ -l h_vmem=70G,h_rt=72:00:00,tmem=70G, it looks like you limited the memory usage to 70GB. I recommend to set the memory limit to 250GB.

Leo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/hisat2/issues/332#issuecomment-952223073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWUNNCABH65FSMCLRDBX7TUI3265ANCNFSM5GXVYO7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

imzhangyun commented 2 years ago

Hello @aleighbrown ,

I am not very familiar to the job submission system you are using. I am not sure whether HISAT-3N can access the whole 280GB (4 * 70GB) memory. Is is possible to provide the memory usage information for this building process?

Leo

aleighbrown commented 2 years ago

Hi there Leo,

Still seems to error out at the same spot. See output below.

Do you think I should sample down the SNPs I'm inputting? There's a lot of integenic SNPs Error message:

MERGE, UPDATE RANK: 242
Generation 4 (3017904516 -> 3017902501 nodes, 42907013 ranks)
Out of memory while constructing suffix array.  Please try using a smaller
number of blocks by specifying a smaller --bmax or a larger --bmaxdivn
Total time for call to driver() for forward index: 05:42:55
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 --bmaxdivn 246 --base-change T,C --snp /SAN/vyplab/vyplab_reference_genomes/hisat-3n/wtc11_vcf.snp --3N /SAN/vyplab/vyplab_reference_genomes/sequence/human/gencode/GRCh38.primary_assembly.genome.fa /SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp 
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.1.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.2.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.3.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.4.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.5.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.6.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.7.ht2l" file written during aborted indexing attempt.
Deleting "/SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp.3n.CT.8.ht2l" file written during aborted indexing attempt.

Submission script

(base) [annbrown@morecambe2 ~]$ cat hisat3index.sh
#!/bin/bash
#Submit to the cluster, give it a unique name
#$ -S /bin/bash

#$ -cwd
#$ -V
#$ -l h_vmem=250G,h_rt=72:00:00,tmem=250G

# join stdout and stderr output
#$ -j y
#$ -R y

/SAN/vyplab/alb_projects/tools/hisat-3n/hisat-3n-build  \
--bmaxdivn 246 --large-index --base-change T,C --snp /SAN/vyplab/vyplab_reference_genomes/hisat-3n/wtc11_vcf.snp \
/SAN/vyplab/vyplab_reference_genomes/sequence/human/gencode/GRCh38.primary_assembly.genome.fa  /SAN/vyplab/vyplab_reference_genomes/hisat-3n/human/with_snp

I'm trying again with 350 GB but it might take a while before it gets throught the queue and I find out if it succeeds or no.

imzhangyun commented 2 years ago

Hello @aleighbrown ,

I am really sorry about this problem. Could you sample down the SNP and try again?

Best, Leo

aleighbrown commented 2 years ago

No worries Leo - thanks for the quick replies

I removed all the intergenic SNPs which brought me down to around 3,193,923 (which will maybe be too many still - unsure, we shall see).

What does HISAT-3N do with the SNP information? Does it use it for the Yf tag? (e.g. a T>C snp doesn't called as a conversion for base-change T,C)

imzhangyun commented 2 years ago

Hello @aleighbrown ,

When I testing HISAT-3N, I build the human graph index with 14,000,000 SNPs with 256GB memory. The memory usage is depended on 2 things. First is the number of SNPs. Second is the complexity of the graph. For your case, I guess the graph you are building is very complicate in a small region and consume a lot of memory.

HISAT-3Nuse the SNP information during the alignment process and make alignment more accurate. For Yf tag, we calculate it without SNP information.

Best, Leo

aleighbrown commented 2 years ago

Ah so the Yf tag would include SNPs?

Thanks Leo - I'll try a few more tweaks, we'll see if upping memory does anything as well.

Are there any other flags I could try to play around with?

imzhangyun commented 2 years ago

Let me make a example for the Yf tag calculation:

for --base-change T,C

                      /C\ (SNP information for the first T)
Reference sequence: ACGTT
Read sequence:      ACGCT

In this case, HISAT-3N outputs Yf:i:1, because we don't know whether the C in position 4(1-base) is a conversion or a known SNP. HISAT-3N calculates the Yf tag by compare the read sequence with reference sequence without SNP information.

Sorry we don't have any other flag you can change to reduce the memory usage for graph index building.

Leo