Segmentation Fault in Grace Cluster Slurm Jobs

AnnabelPerry commented 3 years ago

I'm battling more supercomputer demons this morning - I'm getting segmentation faults when I run jobs to collect runtime info from the MicroGenotyper() function:


 *** caught segfault ***
address 0x202568ca527, cause 'memory not mapped'

Traceback:
 1: MicroGenotyper(bams, "/scratch/user/annabelperry/PollyRuntimes/InputFiles/Edited_Birch_Lookup_Table.csv",     scaffold_vector, output_names)
An irrecoverable exception occurred. R is aborting now ...
/sw/hprc/sw/R_tamu/bin/Rscript: line 75: 102073 Segmentation fault      (core dumped) ${EBROOTR}/bin/Rscript ${ARGS[@]}
rm: cannot remove 'aligned_SRR6511793.bam': No such file or directory

When I run the seff command on the slurm jobs, it shows the job has not used all the requested memory:

[annabelperry@grace1 5]$ seff 719533 
Job ID: 719533
Cluster: grace
User/Group: annabelperry/annabelperry
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 80
CPU Utilized: 01:55:19
CPU Efficiency: 1.20% of 6-16:21:20 core-walltime
Job Wall-clock time: 02:00:16
Memory Utilized: 1.86 TB
Memory Efficiency: 65.04% of 2.86 TB

Since I am just trying to collect the script's runtime, I delete the input and output directly after running (to save space). The input (aligned_SRR6511793.bam) is in the /scratch/user/annabelperry/PollyRuntimes/InputFiles/ directory, as shown in the R script, but when I called rm in the job script I forgot to include the full directory, so that's why you see a "no such file or directory" error. This is not the cause of the segmentation fault, though, because the segmentation fault occurs while the R script is running, and I use the correct directory in the R script itself (see below)

library("Polly")

setwd("/scratch/user/annabelperry/PollyRuntimes/MicroGenotyper")

scaffold_vector <- c("ScyDAA6_1508_HRSCAF_1794", "ScyDAA6_1196_HRSCAF_1406",
                     "ScyDAA6_5987_HRSCAF_6712", "ScyDAA6_8_HRSCAF_51",
                     "ScyDAA6_1107_HRSCAF_1306", "ScyDAA6_2393_HRSCAF_2888",
                     "ScyDAA6_1592_HRSCAF_1896", "ScyDAA6_1439_HRSCAF_1708",
                     "ScyDAA6_1854_HRSCAF_2213", "ScyDAA6_10_HRSCAF_60",
                     "ScyDAA6_11_HRSCAF_73", "ScyDAA6_695_HRSCAF_847",
                     "ScyDAA6_1934_HRSCAF_2318", "ScyDAA6_5078_HRSCAF_5686",
                     "ScyDAA6_5984_HRSCAF_6694", "ScyDAA6_2469_HRSCAF_2980",
                     "ScyDAA6_1473_HRSCAF_1750", "ScyDAA6_5983_HRSCAF_6649",
                     "ScyDAA6_1859_HRSCAF_2221", "ScyDAA6_2_HRSCAF_26",
                     "ScyDAA6_7_HRSCAF_50", "ScyDAA6_2113_HRSCAF_2539",
                     "ScyDAA6_2188_HRSCAF_2635", "ScyDAA6_932_HRSCAF_1100")

bams <- c("/scratch/user/annabelperry/PollyRuntimes/InputFiles/aligned_SRR6511793.bam")

output_names <- c("MGR-F4.csv")

ptm <- proc.time()

MicroGenotyper(bams,"/scratch/user/annabelperry/PollyRuntimes/InputFiles/Edited_Birch_Lookup_Table.csv",scaffold_vector,output_names)

MicroGenotyperRunTime <- proc.time() - ptm

print("\nRuntime for Microgenotyper on Bam File 4: ")

print(MicroGenotyperRunTime)

The sysadmins are shutting Grace down for maintenance all day tomorrow. Hopefully this is one of the issues they're going to fix.

Originally posted by @AnnabelPerry in https://github.com/AnnabelPerry/Polly/issues/4#issuecomment-901096273

eddelbuettel commented 3 years ago

Right at the top you have rm: cannot remove 'aligned_SRR6511793.bam': No such file or directory

What happens when you don't do that?

(In any event, it seems like you transferred over into the territory of 'random bugs during development and use' rather than the nastier 'omg I cannot build this thing' ...)

Both require similar debugging skills, I find. Decompose and decompose and decompose ... into smaller and smaller task til the issue screams at you.

AnnabelPerry commented 3 years ago

Originally, I ran the job without the rm aligned_SRR6511793.bam. I got the same segmentation fault when I did this. Since I was running multiple jobs at once, each with very large inputs and outputs, I quickly ran out of space on my cluster. I thought maybe the segmentation fault was due to this lack of space, so I added the rm aligned_SRR6511793.bam line to free up space on the cluster after the job was completed. However, I forgot to add the full directory to aligned_SRR6511793.bam, so I got that error in addition to the segmentation fault. I'm going to try running with the appropriate directory and see if this fixes the issue. The job took ~2 hours to reach the segfault last time so it'll prolly be about that long before I have the answer.

AnnabelPerry commented 3 years ago

Here are my old job specs (I'm adding the directory and re-running right now). I'm running this with the largest possible Grace specs since I'm dealing with a whole genome:

#!/bin/bash
#SBATCH --export=NONE
#SBATCH --get-user-env=L
#SBATCH --job-name=MGTest4
#SBATCH --time=10:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=80    
#SBATCH --mem=2929G
#SBATCH --partition=bigmem
#SBATCH --output=OUT_MGTest4.%j
#SBATCH --error=ERROR_MGTest4.%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=annabelperry@tamu.edu

module load GCC/10.2.0
module load OpenMPI/4.0.5
module load R_tamu/4.0.3
module load HTSlib/1.11

cd /scratch/user/annabelperry/PollyRuntimes/MicroGenotyper

Rscript MGTest4.R

rm aligned_SRR6511793.bam

rm MGR-F4.csv

eddelbuettel commented 3 years ago

You will need to allow me to step away from this as it no longer has anything to with Rcpp, but briefly:

Replace an unconditional rm file with rm -f file -- the 'force' mode makes it not compile
Alternatively, test for the file: test -f aligned_SRR6511793.bam && rm aligned_SRR6511793.bam
You said you didn't know MPI. You are using here via OpenMPI/4.0.5 (and HTS may use it...)

AnnabelPerry commented 3 years ago

Ok, thank you for all of your help! I'll see if the supercomputer people can help solve this. Also, I've written the introduction and materials & methods for the paper. My PI has already reviewed them. Once I have these bugs worked out and the runtimes collected, I'll finish the results, discussion, and abstract, then we're going to submit the paper to Molecular Ecology Resources. I can send them your way if you'd like to see them before we send it off.

eddelbuettel commented 3 years ago

If there is anything in the paper on the package design or deployment or build I can help with (as you offered a contributor role) I would be more than happy to look it over in pdf or, if the paper is in latex or markdown, source form. Word is trickier.

eddelbuettel commented 3 years ago

(To a first instance a seg fault is often a logic fault. You may need smaller and smaller code examples to see. Hard to tell. Also ideally check on the laptop if you but if the sequence data is too large ... that is tricky. The Bioconductor people are nice and friendly and have an open slack, I hang there for some things... )

AnnabelPerry commented 3 years ago

Ok, the paper is in word right now. I don't have anything in the paper on how the R package is built - I mostly talk about how the functions work. My PI and I were thinking about putting you in as second author since you've helped me more than anyone else has. As far as the segfault goes, I've tested the raw C++ code on very small input files to manually check that the outputs were correct (I've saved very detailed methods on exactly how I did this), and I've also tested the C++ code on the same input files I'm using here (I've been working on this program for a year now). I actually have the final output from when I ran the original C++ program on all my input files. Also, I ran MicroGenotyper() (the function I'm testing here) successfully on very small input files on Terra and manually checked the output. The only difference with these job runs is that 1. I'm using Grace rather than Terra and 2. I'm using the R-wrapped functions on large inputs rather than C++ functions on large inputs or R-wrapped functions on small inputs. The sysadmins said they're working on "storage issues" in Grace tomorrow. I'm thinking that these weird memory bugs may be sorted out after that, but I'm still going to sit here and try to trouble-shoot for at least a little while today (I'm at my parents' house in rural Texas, so the other option is to stare at the cows...). Thanks for the tips! I'll check out the Bioconductor Slack.

eddelbuettel commented 3 years ago

(I'm on Linux so pdf may work; else OpenOffice can render.) I still think it is very generous of you two to put me on which is why I should at least read it too :)

eddelbuettel commented 3 years ago

Yeah, debugging gets harder once it gets input-dependent. But maybe between laptop, Terra and Grace you can triage?

The other part, and I can [ and should in the 'that is how it is done' part ] show you how to add simple unit tests to the package. They would run at GitHub Actions too if we want to turn that on. Is there a 'small sized' bam file we could either embed in the package, or download as needed?

AnnabelPerry commented 3 years ago

Ok! I can convert the paper to PDF. And of course - I'm really grateful for the time and effort you've invested in my project. My PI moved to Italy this summer (he accepted a job there), so it's been very stressful for me to finish this project without any assistance (especially since I had an REU at a different university earlier in the summer so I had <0 time to work on wrapping Polly haha).

The smallest bam file I have for our model organism is 8.92 GB. If I can find an open-access scaffolded genome for a species with a smaller genome (like C. elegans or something), I could create a smaller bam using a file from the NCBI SRA. Would this be helpful for people downloading the package?

Just as an update, 3 of the 19 jobs I ran using MicroGenotyper() did work (I ran one on each of my 19 bam files because I plan on making a "file size vs. runtime" graph for this function). These three were 10.92 GB, 12.96 GB, and 13.03 GB in size. My bam files range from 8.92GB-22.20GB. Since jobs running on smaller/comparably-sized inputs still failed, I don't think there was truly an issue with memory allocation. I think something may be mussed up behind the scenes on Grace, or there may have been corruption during file transfer.

eddelbuettel commented 3 years ago

Damn those are big files and I concur. Local admin help may be best.

Even 8 Mb would be too big for a package (but could go into a data package). I was hopeing 8 (or 80) kb (compressed). No mas.

AnnabelPerry commented 3 years ago

Haha alright - I've emailed the admins. Hopefully this issue will at least be on their radar for maintenance tomorrow

AnnabelPerry commented 3 years ago

Got it fixed! I'm checking for other issues now

AnnabelPerry / Polly

Segmentation Fault in Grace Cluster Slurm Jobs #7