BIMSBbioinfo / genomation

R package for genomic feature analysis and visualization
http://bioinformatics.mdc-berlin.de/genomation/
73 stars 22 forks source link

AnnotationByFeature - get non-overlapping "pure" features // Granges #205

Closed bri2020 closed 2 years ago

bri2020 commented 2 years ago

Dear Altuna Akalin, Dear others,

I would like to know if you have an idea or strategy, how to proceed with the information from a AnnotationByFeature -object generated by annotateWithGeneParts or annotateWithFeatureFlank method. E.g. the aim is, that I would like to extract the methylation status for the specific features.

How could I extract the information for promotor, exon, intron into coordinates or Granges? Is there a way to generate something? I saw in another issue that you mentioned https://github.com/BIMSBbioinfo/genomation/issues/186 So I basically I could do XYZ_exons=subsetByOverlaps(XYZ_gr,gene.parts$exon

However I like the idea of priority of promoter > exons > introns and - this membership data frame could be interesting to extract only "pure" features. How can I keep the priority for promoter > exons > introns when subsetting the data with subsetByOverlaps and assigning unique features.

> getTargetAnnotationStats(XYZ_annot, percentage =FALSE,precedence = TRUE)
  promoter       exon     intron intergenic 
      3072       4253       7613      17987 

.... I would like to get the same numbers for

YZ__gr_exons =subsetByOverlaps(unite_meth_grl,gene.parts$exons)
YZ__gr_promoters =subsetByOverlaps(unite_meth_grl,gene.parts$promoters)
YZ_r_introns =subsetByOverlaps(unite_meth_grl,gene.parts$introns)
YZ__gr_tss =subsetByOverlaps(unite_meth_grl,gene.parts$TSSes)

> length(YZ__gr_exons) #4514
[1] 4514
> length(YZ__gr_promoters) #3072
[1] 3072
> length(YZ_gr_introns)#7923
[1] 7923
> length(YZ__gr_tss)#0
[1] 0

... but thats because there are several hits in different features of course.

Questions

  1. Are you aware of any package to exclude overlapping granges? E.g. Setdiff?

a. from YZ__gr_exons, I would remove all promotor positions b. from YZ_gr_introns, I would remove all exon and promotor position

Correct?

  1. How to you calculate intergenic in your package? Is there a way to extract this as Granges?

I guess you had only count the ones which, havent gotten any assignment in your code, correct? Here: intergenic = 100*sum(rowSums(memb)==0)/nrow(memb) )

That means I could also generate a granges of what is not YZ__gr_exons AND YZ__gr_promoters AND YZ_gr_introns.

OR another idea:

  1. what is the order in an object (annotateWithFeatureFlank and/or annotateWithGeneParts) - does it has the order as the target?

a) It would be already helpful to know the order - so than I can extract the methylation matrix and just add/merge the membership matrix to it write a little loop to assign only one feature. like this df = cbind(getAssociationWithTSS(methyl_filtered_annot), as.data.frame(getMembers(methyl_filtered_annot))) and add that to a Methylkit-object.

Can you help me reading you code about the order? Am I at the right position? Is it in both parts the order of the target?

  1. geneparts: https://github.com/BIMSBbioinfo/genomation/blob/c049457c374ab2a37502232cafc8ad35ebf10cbf/R/readAnnotate.R#L379

    memb = data.frame(matrix(rep(0,length(gr)*3),ncol=3) )
    colnames(memb)=c("prom","exon","intron")
    memb[countOverlaps(gr,prom) > 0,1] = 1
    memb[countOverlaps(gr,exon) > 0,2] = 1
    memb[countOverlaps(gr,intron) > 0,3] = 1
  2. annotateWithFeatureFlank: https://github.com/BIMSBbioinfo/genomation/blob/c049457c374ab2a37502232cafc8ad35ebf10cbf/R/readAnnotate.R#L641

 memb[countOverlaps(target,feature)>0,1]=1

b) can I extract the position (chromosome and START) for this membership matrix?

I am sorry for my loud thinking ;D, but maybe you can enlighten me a bit more. Kind regards, Thanks a lot, Britta Meyer

bri2020 commented 2 years ago

I think I answered myself most of the questions meanwhile: I checked and it seems that in both cases "annotateWithFeatureFlank and/or annotateWithGeneParts" the overlap matrix is calculated with "countOverlaps" and thus the order of the query object is kept. I also easily manage to extract the positions knowing the order of counts/overlaps. However I would be happy for any input/doubts coming up at a later stage. Thanks Britta