Sequence lengths for reference genomes mm9, mm10, hg38 and hg38_latest are now stored internally within the package.
mm9 was working fine but for hg38, the values that were imported using the BSgenome.Hsapiens.UCSC.hg38 package were not matching exactly the ones in the hg38.fa file to which I mapped the datasets. This was generating a million warning messages each time bins were called with genome = hg38 parameter.
The reason for this is that UCSC download for hg38 genome has the main reference in the root directory and then there is another latest/ that has another reference. I believe the differences between these two are not going to generate drastically different results (the difference seems to be mostly some contig fixes that were not present in the hg38 version we have mapped to).
As of now, all the datasets we have generated for human correspond to hg38 parameter (not the latest patch). If we remap to the latest version we can shift to hg38_latest and compare.
Two extra advantages of this implementation: 1) Some package dependencies have been removed (BSgenome.Hsapiens.UCSC.hg38, BSgenome.Mmusculus.UCSC.mm9), making installation a bit lighter. 2) We can be sure the references we map to match to the ones we get within the package. Otherwise, updates in these dependencies could make this fail in the future again.
Sequence lengths for reference genomes
mm9
,mm10
,hg38
andhg38_latest
are now stored internally within the package.mm9
was working fine but forhg38
, the values that were imported using theBSgenome.Hsapiens.UCSC.hg38
package were not matching exactly the ones in thehg38.fa
file to which I mapped the datasets. This was generating a million warning messages each time bins were called withgenome = hg38
parameter.The reason for this is that UCSC download for
hg38
genome has the main reference in the root directory and then there is anotherlatest/
that has another reference. I believe the differences between these two are not going to generate drastically different results (the difference seems to be mostly some contig fixes that were not present in thehg38
version we have mapped to).As of now, all the datasets we have generated for human correspond to
hg38
parameter (not the latest patch). If we remap to the latest version we can shift tohg38_latest
and compare.Two extra advantages of this implementation: 1) Some package dependencies have been removed (
BSgenome.Hsapiens.UCSC.hg38
,BSgenome.Mmusculus.UCSC.mm9
), making installation a bit lighter. 2) We can be sure the references we map to match to the ones we get within the package. Otherwise, updates in these dependencies could make this fail in the future again.Fixes #70