Closed FedericaBrando closed 5 months ago
As of now:
There is a problem when making the region file:
the step that splits the mutations to annotate into n-1 cores gets stuck, it's been 4 days like this.
This is part of the jobs sent to the cluster - in this case: bbgn013; 56 cores; 16 memory - :
...
+ name=53
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.51 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_51.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.52 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_52.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep
+ wait
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.53 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_53.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep
When I look into the tmp folder, the file do not change the dimensions:
❯ ll /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/ | head
total 19G
drwx------ 2 fbrando nlopezb_g 4.0K Mar 29 23:36 ./
drwx------ 6 fbrando root 94 Mar 29 23:35 ../
-rw-rw---- 1 fbrando nlopezb_g 5.6G Mar 29 23:16 mutations.tsv
-rw-rw---- 1 fbrando nlopezb_g 5.6G Mar 29 23:35 muts.tsv
-rw-rw---- 1 fbrando nlopezb_g 101M Mar 29 23:35 split.00
-rw-rw---- 1 fbrando nlopezb_g 19M Mar 29 23:37 split_00.vep.out
-rw-rw---- 1 fbrando nlopezb_g 103M Mar 29 23:36 split.01
-rw-rw---- 1 fbrando nlopezb_g 33M Mar 29 23:38 split_01.vep.out
If I run the command in an interactive 1 core, memory 6GB, after a couple of hour I have a file split_00.vep.out of 4.0GB
❯ ll split_00.vep.out
-rw-rw---- 1 fbrando nlopezb_g 4.0G Apr 2 18:55 split_00.vep.out
The process in the bbgn013
node are in Running state. It's a bit weird.
ping to @migrau
I can see the swap mem at 100% in that node. Maybe because you "only" requested 16GB of ram mem, it started using the swap and it is slowing down all the r/w processes? Not sure, but because you requested all the cpus in that node, you could request all the ram too. In your test example the ratio cpu/mem was super high comparing to 56cpu/16GB
16000MB/56cpus is about 285MB per vep command which probably is not optimal.
what was the solution, Fede?
As you pointed out, I was using too less memory, so I allocated 200GB and it was done in few hours! Silly me 🥲
We found out some missing genes in the CDS_BIOMART
which should have been there.
New building of the region was made and as of now IntOGen is running again.
Region dataset contains all expected genes. IntOGen run completed and stored here:
/workspace/datasets/intogen/runs/v2024/20240409_ALL/
We decided to update the regions (include 25bp splicing sites) for all methods when we have more clarity of the impact of this change in the pipeline.
new issue for implementing possible solution regarding the problem of regions here:
To build the region dataset in IntOGen 23 we use different steps.
ENSEMBL_TRANSCRIPT
a. This file is used in PARSEDVEP step to filter vep results.BIOMART_CDS
CDS_FILT_BIOMART
CDS_FILT_BIOMART
is then used to build several things:a. refCDS for
dNdScsv
b. cds.regions.gz for IntOGen pipeline [ComputeProfile
,OncodriveFML
,OncodriveCLUSTL
,SMregions
] c. canonical.regions.gz for BoostDM connectionIn IntOGen v2024, we want to update to MANE. This will change the above mentioned steps:
ENSEMBL_TRANSCRIPT
should not be used anymore. Nor for building the dataset of the regions, nor for the pipeline.BIOMART_CDS
. This file is a dependancy to built several things: a. refCDS for dNdScsv b. cds.regions.gz c. canonical.regions.gzProblems:
ENSEMBL_TRANSCRIPT