Crash at the end of step 3

carlahurt commented 3 years ago

Hello, I am running a denovo alignment on a very large salamander GBS dataset. Step 3 took a very long time (12 days), and then something appeared to have happened during the aligning step. This caused it to crash in step 6. Any ideas on what went wrong? Also, is it possible to pick this back up after "chunking clusters" in step 3 and skip the long wait? I posted the errors in 2 snips - the messages were too long to fit in a single screen shot. I'm also attaching my params file for reference.

params-barb1.txt

Thank you, Carlaa

isaacovercast commented 3 years ago

Hello. This is almost certainly a disk space issue. I'm positive that you ran out of disk space right at the beginning of the aligning step. Please be sure that you have plenty of free disk space, like 100's of GB or more, as much as you can get. ipyrad creates numerous temporary files (most of which are cleaned up after each step) but we assume that disk space is essentially unlimited. If your assembly is especially large then this is even more true. Unfortunately there isn't a way to restart within a given step, so you'll have to run step 3 again. The clustering step can be impacted by very long reads, paired end data, and also very noisy data. I would verify with fastqc that your data does not contain a significant amount of low quality bases, and if it does I would trim the reads (during step 2) to remove as much of this as possible. This will speed up the clustering step.

mikerenfro commented 3 years ago

Hey, Isaac, Dr. Hurt’s HPC admin here. Where are the temporary files stored by default: /tmp or elsewhere? If /tmp, is there a way to override that, via environment variable or parameter?

isaacovercast commented 3 years ago

Hello Mike, all the temporary files are created within the project_dir which is specified within the params file for a given assembly. We don't touch the filesystem outside of this directory.

mikerenfro commented 3 years ago

In that case, we should have some dozen or so TB free there. Can look more closely on Tuesday or later.

isaacovercast commented 3 years ago

If it's not disk it could also be some other resource if there are quotas, for example I have seen issues with quotas on max number of files which could cause a similar kind of behavior.

mikerenfro commented 3 years ago

Far as I know, we have quota reporting, but no limits on space, file count, inodes, etc. Can verify later and let you know, thanks.

carlahurt commented 3 years ago

Hi Mike,

Thank you for trying to figure this out. Let me know if I need to delete some files.

Carla

Carla Hurt Associate Professor of Biology Tennessee Tech University (931) 372-3143 https://sites.tntech.edu/hurtlab

From: Mike Renfro @.> Sent: Sunday, July 4, 2021 4:45:02 PM To: dereneaton/ipyrad @.> Cc: Hurt, Carla @.>; Author @.> Subject: Re: [dereneaton/ipyrad] Crash at the end of step 3 (#448)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

Far as I know, we have quota reporting, but no limits on space, file count, inodes, etc. Can verify later and let you know, thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/dereneaton/ipyrad/issues/448#issuecomment-873668260, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHOP65CRHPFCFXNVZFY6ALDTWDI55ANCNFSM47YCGL7Q.

mikerenfro commented 3 years ago

From the file server's quota reports:

User quota on /mnt/xfs1 (/dev/mapper/VGMD1-LVMD1)
                        Blocks                            Inodes
User ID      Used   Soft   Hard Warn/Grace     Used   Soft   Hard Warn/Grace
---------- --------------------------------- ---------------------------------
...
churt        2.8T      0      0  00 [------] 307.4k      0      0  00 [------]
...

So there shouldn't be any file size or file count limits in place.

isaacovercast commented 3 years ago

Some strategic googling leads me to believe that the error messages we are seeing are SLURM red-herrings: https://github.com/E3SM-Project/E3SM/issues/3138 https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04725.html https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.list.fv3-announce/2020-September/000410.html

Looking back at the output you originally sent it seems possible that these messages are internal SLURM noise and that they ipyrad assembly actually completed successfully.

Is it possible that step 6 actually ran to completion? Did you try running step 7?

carlahurt commented 3 years ago

You're right - it worked!!! Thank you for your help - I apologize for the unnecessary drama. I assumed that with the two pages of error messages it couldn't have worked.

isaacovercast commented 3 years ago

Very good. This is my favorite kind of problem.... the one that solves itself ;)

dereneaton / ipyrad

Crash at the end of step 3 #448