Errors seen across multiple samples related to memory

slsevilla commented 1 year ago

Problem: Multiple errors seen across a significant number of samples in latest sample run.

Example errors from log /data/khanlab2/processed_DATA/ngs_pipeline_SJ031111=SJEPD031111_D1=20220911_20221116_151637.log :

Error executing rule FUSION_CATCHER on cluster (jobid: 33, external: 52717158, jobscript: /gpfs/gsfs10/users/khanlab2/processed_DATA/.snakemake/tmp.yznfwbmd/FUSION_CATCHER.33). For error details see the cluster log and the log files of the involved rule(s).
Error executing rule mixcr_RNASeq on cluster (jobid: 46, external: 52717129, jobscript: /gpfs/gsfs10/users/khanlab2/processed_DATA/.snakemake/tmp.yznfwbmd/mixcr_RNASeq.46). For error details see the cluster log and the log files of the involved rule(s).
Error executing rule FUSION_CATCHER on cluster (jobid: 33, external: 52732657, jobscript: /gpfs/gsfs10/users/khanlab2/processed_DATA/.snakemake/tmp.yznfwbmd/FUSION_CATCHER.33). For error details see the cluster log and the log files of the involved rule(s).
Error executing rule mixcr_RNASeq on cluster (jobid: 46, external: 52735119, jobscript: /gpfs/gsfs10/users/khanlab2/processed_DATA/.snakemake/tmp.yznfwbmd/mixcr_RNASeq.46). For error details see the cluster log and the log files of the involved rule(s).
Error executing rule arriba on cluster (jobid: 34, external: 52717235, jobscript: /gpfs/gsfs10/users/khanlab2/processed_DATA/.snakemake/tmp.yznfwbmd/arriba.34). For error details see the cluster log and the log files of the involved rule(s).
Exiting because a job execution failed. Look above for error message

Review of one error log log/FUSION_CATCHER.52732657.e

Error message:

tr: write error: Disk quota exceeded

Solution: It appears that the errors related to this project are due to disc space issues. Considering we are attempting to move analysis to a new location (related to problem with Biowulf (#12) this is a larger concern. We are not utilizing scratch space effectively and are keeping intermediate files not being used by downstream analysis, which leaves a large pipeline footprint per sample. Will need to determine a course of action to be able to handle the reprocessing of samples + new samples coming through the pipeline more effectively.

kopardev commented 1 year ago

I have a few questions/observations:

tr: write error:... ... what is tr? Does it mean that the error occurred while running the command tr or is it simply abbreviation for trace or something?

checkquota returns this

% checkquota|grep -i "khan\|clin"
/data(Clinomics):          22.0 TB    31.0 TB   71.03%   602950 32000000    1.88%
/data(khanlab):           218.4 TB   221.0 TB   98.80% 22798400 32000000   71.25%
/data(khanlab2):           93.2 TB   117.0 TB   79.65%  3320539 31457280   10.56%
/data(khanlab3):          155.9 TB   200.0 TB   77.97%  4860086 32000000   15.19%

which suggests we have lots of space under khanlab2... so why Disk quota exceeded?

slsevilla commented 1 year ago

Talked with Xinyu this morning and it was a memory issue (perhaps he had deleted files in between the errors and you running checkquota). He has a list of the projects affected and will delete these runs and restart.

CCRGeneticsBranch / khanlab_ngs_pipeline

Errors seen across multiple samples related to memory #13