Open msguixe opened 3 months ago
Update after discussing with Claudia Arnedo:
There are 3 main steps for generating the mappable genome:
/workspace/projects/pileup_mappability/hg38/hg38_100bp.coverage.regions.gz
The input file generated in step 1, the downloaded encode blacklisted regions file (not the latest) and the gnomad 3.0 file are in Claudia's hotspots repo here: /workspace/projects/hartwig/hotspots/hotspotfinder/2022_06/mappable_genome/data/inputs/
. In this folder there is a file paths.txt
with the paths to the original files.
Questions:
/workspace/projects/genomic_regions/
where we store and maintain the mappable genome along with its input files? To discuss with everyone on the next hackathon.Thanks Monica for the details on this!
I do not know who is using it or how critical it is for the different projects. For example I don't know what decision we would make if this overlaps a gene that we have in one of our panels. (probably discard it in the next design?) We could probably also use it for tagging potentially artifactual mutations.
But I think it makes sense to regenerate it using this new tool since we can also account for the unique mapping when having at least one mismatch which is an important case we are interested in. I have no idea about other tools.
For point two, I see the value in having everything in the genomic_regions
folder (or whichever folder we decide to put all the reference genomes information) to make it accessible for everyone. But yes let's discuss it in the next hackathon.
Thanks again Monica!
Thanks Ferriol.
Definitely we need to talk about ref genome information in the cluster. We have several folders:
/workspace/projects/pileup_mappability/
: mappable genome with GEM mapper.
/workspace/datasets/mappability_blacklists/
: blacklisted regions from ENCODE
/workspace/datasets/genomes/
: reference genomes
/workspace/projects/genomic_regions/
: genomic regions annotations
..and possibly others that I am missing.
By now I will generate a mappable genome with an updated blacklisted regions file and save it here:
/workspace/projects/pileup_mappability/hg38/hg38_100bp.coverage.regions.filtered_blacklisted.gz
The updated blacklisted regions file will be downloaded here:
/workspace/datasets/mappability_blacklists/hg38/20240807/
We can further discuss the organisation of these folders in the hackathon.
There are two other issues related to this: #77 and #93.
I have compared the trinucleotide counts of the mappable genome (with updated ENCODE blacklisted regions and without filtering the SNPs from gnomAD) with the counts of the genome used in COSMIC (the one used to calculate the signatures). Overall it is very similar. We could scale the matrices before running any signature tool, if needed.
This comparison could be shown in the bbgwiki mappable genome page.
@koszulordie
Datasets > Other data
Create new section to explain the mappable genome
/workspace/projects/hartwig/hotspots/hotspotfinder/