cole-trapnell-lab / cicero-release

https://cole-trapnell-lab.github.io/cicero-release/
MIT License
56 stars 14 forks source link

Handle the case when a peak overlaps with the promoter of two or more genes #73

Open yushengak47 opened 3 years ago

yushengak47 commented 3 years ago

Hi,

I found that, when a peak overlaps with the promoter of two or more genes, the default settings of annotate_cds_by_site only record one of them in the 'gene' column of fData(input_cds). As a result, some genes are missing in the gene activity matrix. I have tried to set all = T when running annotate_cds_by_site, this indeed list multiple gene names in the 'gene' column. However, it seems that build_gene_activity_matrix doesn't handle it properly. The generated matrix might be redundant and problematic, for example, it has rows named "HES2,HES2,HES2,HES2", "ESPN,ESPN,HES2", et. al.

Any idea for solving the problem?

Thanks

hpliner commented 3 years ago

Hmm, this is a case that would require some modifications to fix. However I will say that the gene activity score values for two genes with the same promoter peak will be identical, so if you have a list of the sets of genes that share a promoter, you would be able to add in the appropriate rows.

I will leave this open and hopefully find time to find a solution in the future.