BIMSBbioinfo / pigx_chipseq

Pipeline for Analysis of ChIP-Seq data
http://bioinformatics.mdc-berlin.de/pigx/
GNU General Public License v3.0
11 stars 10 forks source link

Elaborate more about the details inferring genome size from input genome. #97

Open alexg9010 opened 6 years ago

alexg9010 commented 6 years ago

from version v0.0.18 we start to infer the genome size for peak calling from the provided genome file, before we set this using instead of using macs available defaults. This changes the background normalizations and hence the peaks will differ.

this was introduced to reduce the number of arguments needed to be filled out by the user, but it could be a source of variation if users provide different genomes for different runs on the same fastq files.

the user should expect different results when they use a different genome, right ? there will be already variation if you use the canonical form vs complete genome

the same would apply to re-running the analysis with a different user input genome size (edited) the results would change

yes, but if user sets hs or mm as value, then it will always be the same … Anyways, in general using the size of the provided genome is a better way of controlling it. I just need to write somewhere that the results could differ if user decides to change/reduce the genome.

alexg9010 commented 6 years ago

Looks like it is a research topic on its own. The ‘effective genome size’. There is a database of genome sizes http://www.genomesize.com/results.php?page=1 The effective genome size is the mappable proportion of the genome - which is dependent on the read lengths and the mapping strategy There doesn’t seem to be a good consensus either. Here they claim this value for human should be about 2.45 gb while MACS2 uses 2.7 gb https://www.nature.com/articles/nbt.1518/tables/1 There is also this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 journals.plos.org Fast Computation and Applications of Genome Mappability We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput… Show more This is how deeptools deal with this problem: https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html which again applies to a number of organisms and not really generic.