Closed brianpenghe closed 6 years ago
For human data (hg38), those files are inside the tar file that you can download from here: https://www.encodeproject.org/files/ENCFF967OQF/. For mouse data (mm10), the equivalent tar file can be downloaded from here: https://www.encodeproject.org/files/ENCFF026MOH/.
For hg38, after you extract the files from the tar file, you'll have the following files, which you can use with any hg38 input data:
center_sites.starch
(a "center sites" file)chrom_sizes.bed
(the lengths of the chromosomes in hg38)mappable_target.starch
(the mappable regions of hg38, minus the ENCODE blacklist)
The blacklist file and the mappable regions file without the blacklist subtracted from it are included too.NOTE 1: These starch
files were made using an older version of starch
. If you run into any issues as a result of this, simply uncompress and recompress the file(s) using the latest version of starch
, e.g.: unstarch existing.starch | starch - > new.starch ; mv new.starch existing.starch
.
NOTE 2: The names of these files don't include the genome build. You might want to rename them, e.g. mv chrom_sizes.bed chrom_sizes.hg38.bed
, for clarity that might come in handy when new genome builds are released, or if you process data from multiple organisms.
I was trying to run extractCenterSites.sh to get stated but I need an MAPPABLE-REGIONS file I guess. Do you know how I can generate that file? Or is there a place to download mappable files with ENCODE blacklist region subtracted?
Thanks