Open alexg9010 opened 6 years ago
Looks like it is a research topic on its own. The ‘effective genome size’. There is a database of genome sizes http://www.genomesize.com/results.php?page=1 The effective genome size is the mappable proportion of the genome - which is dependent on the read lengths and the mapping strategy There doesn’t seem to be a good consensus either. Here they claim this value for human should be about 2.45 gb while MACS2 uses 2.7 gb https://www.nature.com/articles/nbt.1518/tables/1 There is also this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 journals.plos.org Fast Computation and Applications of Genome Mappability We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput… Show more This is how deeptools deal with this problem: https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html which again applies to a number of organisms and not really generic.
from version v0.0.18 we start to infer the genome size for peak calling from the provided genome file, before we set this using instead of using macs available defaults. This changes the background normalizations and hence the peaks will differ.
this was introduced to reduce the number of arguments needed to be filled out by the user, but it could be a source of variation if users provide different genomes for different runs on the same fastq files.
yes, but if user sets
hs
ormm
as value, then it will always be the same … Anyways, in general using the size of the provided genome is a better way of controlling it. I just need to write somewhere that the results could differ if user decides to change/reduce the genome.