BBGwiki | add mappable genome info

msguixe commented 3 months ago

Datasets > Other data

Create a new subfolder for mappable genome

Create new section to explain the mappable genome

The mappable genome was created by Claudia Arnedo in the Hotspots project. In this notebook from the hotspots repo is explained how the mappable genome is created, which files are used to filter and from where are these obtained.
All the files regarding the mappable genome are stored here: /workspace/projects/hartwig/hotspots/hotspotfinder/
I have e-mailed Claudia to have a meeting with her so she can explain better how are all files created and which are the most up to date.

msguixe commented 3 months ago

Update after discussing with Claudia Arnedo:

There are 3 main steps for generating the mappable genome:

A first step performed by Loris with a tool GEM Mapper from this paper from 2012. The file obtained can be found here: /workspace/projects/pileup_mappability/hg38/hg38_100bp.coverage.regions.gz
A second step to filter blacklisted regions (aka problematic regions) from ENCODE project. This file is downloaded from enconde project web or from UCSC database. Here is explained what is this file and what it includes. From here we can download the most updated file.
A third step to filter common polymorfisms by using gnomAD. This step may be ommited deppending on our project's objectives. Claudia's hotspots project needed a very clean mappable genome, so she had to avoid any possible germline contamination.

The input file generated in step 1, the downloaded encode blacklisted regions file (not the latest) and the gnomad 3.0 file are in Claudia's hotspots repo here: /workspace/projects/hartwig/hotspots/hotspotfinder/2022_06/mappable_genome/data/inputs/. In this folder there is a file paths.txt with the paths to the original files.

Questions:

Should we update the methodology to generate the file in step 1? I found this method: GenMap. Other suggestions? @FerriolCalvet @rblancomi
Should we create a folder in here: /workspace/projects/genomic_regions/ where we store and maintain the mappable genome along with its input files? To discuss with everyone on the next hackathon.

FerriolCalvet commented 3 months ago

Thanks Monica for the details on this!

I do not know who is using it or how critical it is for the different projects. For example I don't know what decision we would make if this overlaps a gene that we have in one of our panels. (probably discard it in the next design?) We could probably also use it for tagging potentially artifactual mutations.

But I think it makes sense to regenerate it using this new tool since we can also account for the unique mapping when having at least one mismatch which is an important case we are interested in. I have no idea about other tools.

For point two, I see the value in having everything in the genomic_regions folder (or whichever folder we decide to put all the reference genomes information) to make it accessible for everyone. But yes let's discuss it in the next hackathon.

Thanks again Monica!

msguixe commented 3 months ago

Thanks Ferriol.

Definitely we need to talk about ref genome information in the cluster. We have several folders: /workspace/projects/pileup_mappability/: mappable genome with GEM mapper. /workspace/datasets/mappability_blacklists/: blacklisted regions from ENCODE /workspace/datasets/genomes/: reference genomes /workspace/projects/genomic_regions/: genomic regions annotations ..and possibly others that I am missing.

By now I will generate a mappable genome with an updated blacklisted regions file and save it here: /workspace/projects/pileup_mappability/hg38/hg38_100bp.coverage.regions.filtered_blacklisted.gz The updated blacklisted regions file will be downloaded here: /workspace/datasets/mappability_blacklists/hg38/20240807/

We can further discuss the organisation of these folders in the hackathon.

msguixe commented 3 months ago

There are two other issues related to this: #77 and #93.

I have compared the trinucleotide counts of the mappable genome (with updated ENCODE blacklisted regions and without filtering the SNPs from gnomAD) with the counts of the genome used in COSMIC (the one used to calculate the signatures). Overall it is very similar. We could scale the matrices before running any signature tool, if needed.

This comparison could be shown in the bbgwiki mappable genome page.

@koszulordie

bbglab / bbgwiki

BBGwiki | add mappable genome info #171