IntOGen-BoostDM connection | redefine regions

FedericaBrando commented 5 months ago

To build the region dataset in IntOGen 23 we use different steps.

Query ensembl database to retrieve the transcripts (canonical). ENSEMBL_TRANSCRIPT a. This file is used in PARSEDVEP step to filter vep results.
Query biomart to get the exons regions. BIOMART_CDS
Filter BIOMART_CDS with the list of ENSEMBL_TRANSCRIPT. CDS_FILT_BIOMART
This CDS_FILT_BIOMART is then used to build several things:
a. refCDS for dNdScsv b. cds.regions.gz for IntOGen pipeline [ComputeProfile, OncodriveFML, OncodriveCLUSTL, SMregions] c. canonical.regions.gz for BoostDM connection

In IntOGen v2024, we want to update to MANE. This will change the above mentioned steps:

ENSEMBL_TRANSCRIPT should not be used anymore. Nor for building the dataset of the regions, nor for the pipeline.
Add MANE_Select filter inside the query for Biomart when building BIOMART_CDS. This file is a dependancy to built several things: a. refCDS for dNdScsv b. cds.regions.gz c. canonical.regions.gz

Problems:

[x] Delete any steps that uses ENSEMBL_TRANSCRIPT
[x] Rebuild the datasets, making sure the regions are correct
[x] Rerun the pipeline, making sure the regions are correct
[ ] Discuss if it's worth to run the methods with the splicing-sites as well.

FedericaBrando commented 5 months ago

As of now:

IntOGen run with corrected regions is running
vep.tsv.gz is getting built

FedericaBrando commented 5 months ago

There is a problem when making the region file:

the step that splits the mutations to annotate into n-1 cores gets stuck, it's been 4 days like this.

This is part of the jobs sent to the cluster - in this case: bbgn013; 56 cores; 16 memory - :

...
+ name=53
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.51 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_51.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.52 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_52.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep
+ wait
+ singularity exec ../containers_24/vep.simg vep -i /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split.53 -o /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/split_53.vep.out --assembly GRCh38 --no_stats --cache --offline --symbol --protein --tab --canonical --mane --numbers --no_headers --dir ../datasets_24/vep

When I look into the tmp folder, the file do not change the dimensions:

❯ ll /tmp/jobs/fbrando/9477495/tmp.wjmuTgxTjU/ | head
total 19G
drwx------ 2 fbrando nlopezb_g 4.0K Mar 29 23:36 ./
drwx------ 6 fbrando root        94 Mar 29 23:35 ../
-rw-rw---- 1 fbrando nlopezb_g 5.6G Mar 29 23:16 mutations.tsv
-rw-rw---- 1 fbrando nlopezb_g 5.6G Mar 29 23:35 muts.tsv
-rw-rw---- 1 fbrando nlopezb_g 101M Mar 29 23:35 split.00
-rw-rw---- 1 fbrando nlopezb_g  19M Mar 29 23:37 split_00.vep.out
-rw-rw---- 1 fbrando nlopezb_g 103M Mar 29 23:36 split.01
-rw-rw---- 1 fbrando nlopezb_g  33M Mar 29 23:38 split_01.vep.out

If I run the command in an interactive 1 core, memory 6GB, after a couple of hour I have a file split_00.vep.out of 4.0GB

❯ ll split_00.vep.out
-rw-rw---- 1 fbrando nlopezb_g 4.0G Apr  2 18:55 split_00.vep.out

The process in the bbgn013 node are in Running state. It's a bit weird.

ping to @migrau

migrau commented 5 months ago

I can see the swap mem at 100% in that node. Maybe because you "only" requested 16GB of ram mem, it started using the swap and it is slowing down all the r/w processes? Not sure, but because you requested all the cpus in that node, you could request all the ram too. In your test example the ratio cpu/mem was super high comparing to 56cpu/16GB

migrau commented 5 months ago

16000MB/56cpus is about 285MB per vep command which probably is not optimal.

migrau commented 5 months ago

what was the solution, Fede?

FedericaBrando commented 5 months ago

As you pointed out, I was using too less memory, so I allocated 200GB and it was done in few hours! Silly me 🥲

FedericaBrando commented 5 months ago

We found out some missing genes in the CDS_BIOMART which should have been there.

New building of the region was made and as of now IntOGen is running again.

FedericaBrando commented 5 months ago

Region dataset contains all expected genes. IntOGen run completed and stored here: /workspace/datasets/intogen/runs/v2024/20240409_ALL/

We decided to update the regions (include 25bp splicing sites) for all methods when we have more clarity of the impact of this change in the pipeline.

FedericaBrando commented 5 months ago

new issue for implementing possible solution regarding the problem of regions here:

23

bbglab / intogen-plus

IntOGen-BoostDM connection | redefine regions #20

23