BIMSBbioinfo / pigx_chipseq

Pipeline for Analysis of ChIP-Seq data
http://bioinformatics.mdc-berlin.de/pigx/
GNU General Public License v3.0
11 stars 10 forks source link

Overlap might fail if seqlevels do not match #37

Closed alexg9010 closed 5 years ago

alexg9010 commented 6 years ago

https://github.com/BIMSBbioinfo/pigx_chipseq/blob/407ac9b8ef4e69e6c7483eea57f2bfb6103c7a4e/scripts/Functions_Helper.R#L123

alexg9010 commented 6 years ago
> h3k9me3_overlaps <- queryGff(queryRegions = h3k9me3_peaks, gffData = annotations$gtf)
Warnmeldung:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

> seqinfo(annotations$gtf)
Seqinfo object with 7 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  I                NA         NA   <NA>
  II               NA         NA   <NA>
  III              NA         NA   <NA>
  IV               NA         NA   <NA>
  V                NA         NA   <NA>
  X                NA         NA   <NA>
  MtDNA            NA         NA   <NA>
> seqinfo(h3k27me3_peaks)
Seqinfo object with 7 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chrI             NA         NA   <NA>
  chrII            NA         NA   <NA>
  chrIII           NA         NA   <NA>
  chrIV            NA         NA   <NA>
  chrV             NA         NA   <NA>
  chrX             NA         NA   <NA>
  chrM             NA         NA   <NA>

worked after setting same seqlevels

> seqlevels(annotations$gtf) <- seqlevels(h3k27me3_peaks)
> h3k27me3_overlaps <- queryGff(queryRegions = h3k27me3_peaks, gffData = annotations$gtf)
> h3k27me3_overlaps
GRanges object with 56566 ranges and 13 metadata columns:
frenkiboy commented 6 years ago

Check #14 - would be great if you could do it!!

Blosberg commented 6 years ago

From the BS pipeline, I have the following rule:

rule tabulate_seqlengths:
    input:
        rules.bismark_genome_preparation.output
    output:
        seqlengths = DIR_mapped+"Refgen_"+ASSEMBLY+"_chromlengths.csv",
    params:
        chromlines = " | grep Sequence ",
        chromcols  = " | cut -f2,3     ",
        seqnames   = " | sed \"s/_CT_converted//g\" "
    message: fmt("Tabulating chromosome lengths in genome: {ASSEMBLY} for later reference.")
    shell:
        nice('bowtie2-inspect', ['-s ' + GENOMEPATH + "Bisulfite_Genome/CT_conversion/BS_CT", '{params.chromlines}', '{params.chromcols}', '{params.seqnames}', ' > {output}'])

It creates a 2-column text file with seqlevels and chromosome lengths. Once complication here is that this uses the CT converted version of the genome (but that's only a trick to make sure that the conversion is performed, there's no reason we can't use the original genome instead)

e.g. for the mm10 genome the output looks like this:

$ more Refgen_mm10_canon_chromlengths.csv
chr1    195471971
chr2    182113224
chr3    160039680
chr4    156508116
...
chrX    171031299
chrY    91744698

and then I pass this file as chrom_seqlengths into the Rmd rule like this:

seqdat_temp    = read.csv(paste0(out_dir,"/",chrom_seqlengths),
                          sep="\t", header=FALSE, stringsAsFactors=FALSE)
chr.len        = seqdat_temp[,2]
names(chr.len) = seqdat_temp[,1]

myseqinfo = Seqinfo(names(chr.len), seqlengths=chr.len, genome=assembly)
myseqinfo.st = keepStandardChromosomes(myseqinfo)

I'm struggling a bit to figure out exactly where in the code this should go, but the basic idea should work If we just add the above file as an input to whatever is using the seqlengths. Does that work?

alexg9010 commented 6 years ago

pigx-chipseq bug fixes needed before F1000

alexg9010 commented 6 years ago

pigx-chipseq: check for matching sequence names in genome and annotation

alexg9010 commented 6 years ago

Overlap might fail if seqlevels do not match

alexg9010 commented 5 years ago

should be solved with c5225023258e99776ef34b60665df8c633e447a0