elsasserlab / labcode

Utils to perform frequent data analyses in the lab.
GNU General Public License v3.0
0 stars 1 forks source link

Supported reference genome chrom sizes as internal data #81

Closed cnluzon closed 3 years ago

cnluzon commented 3 years ago

Sequence lengths for reference genomes mm9, mm10, hg38 and hg38_latest are now stored internally within the package.

mm9 was working fine but for hg38, the values that were imported using the BSgenome.Hsapiens.UCSC.hg38 package were not matching exactly the ones in the hg38.fa file to which I mapped the datasets. This was generating a million warning messages each time bins were called with genome = hg38 parameter.

The reason for this is that UCSC download for hg38 genome has the main reference in the root directory and then there is another latest/ that has another reference. I believe the differences between these two are not going to generate drastically different results (the difference seems to be mostly some contig fixes that were not present in the hg38 version we have mapped to).

As of now, all the datasets we have generated for human correspond to hg38 parameter (not the latest patch). If we remap to the latest version we can shift to hg38_latest and compare.

Two extra advantages of this implementation: 1) Some package dependencies have been removed (BSgenome.Hsapiens.UCSC.hg38, BSgenome.Mmusculus.UCSC.mm9), making installation a bit lighter. 2) We can be sure the references we map to match to the ones we get within the package. Otherwise, updates in these dependencies could make this fail in the future again.

Fixes #70