broadinstitute / ABC-Enhancer-Gene-Prediction

Cell type specific enhancer-gene predictions using ABC model (Fulco, Nasser et al, Nature Genetics 2019)
MIT License
200 stars 60 forks source link

Difference between EnhancerPredictionsFull_threshold0.02_self_promoter.tsv and EnhancerPredictions_threshold0.02_self_promoter.tsv #223

Open NicoleYY77 opened 4 months ago

NicoleYY77 commented 4 months ago

Hi, I have some issues with interpreting the output files of prediction: there are EnhancerPredictionsAllPutative.tsv, EnhancerPredictionsFull_threshold0.02_self_promoter.tsv, and EnhancerPredictions_threshold0.02_self_promoter.tsv. I initially thought EnhancerPredictionsFull_threshold0.02_self_promoter.tsv is generated by selecting subset of enhancer-genes pairs with ABC.score > 0.02 from EnhancerPredictionsAllPutative.tsv, but the count doesn't match; I'm also confused about the relationship between EnhancerPredictionsFull_threshold0.02_self_promoter.tsv, and EnhancerPredictions_threshold0.02_self_promoter.tsv, they have the same number of rows so I initially thought EnhancerPredictions_threshold0.02_self_promoter.tsv is generated by selecting some core columns from EnhancerPredictionsFull_threshold0.02_self_promoter.tsv, but I found that lots of ABC.score in EnhancerPredictionsFull_threshold0.02_self_promoter.tsv are less than 0.02. I'm not sure if I misinterpret these files, really appreciated if you could help me interpret these output files!

atancoder commented 4 months ago

EnhancerPredictionsFull_threshold0.02_self_promoter.tsv is generated by selecting subset of enhancer-genes pairs with ABC.score > 0.02 from EnhancerPredictionsAllPutative.tsv

This should be true: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/blob/main/workflow/scripts/filter_predictions.py#L50. Can you show your results with ABC scores < .02?

NicoleYY77 commented 4 months ago

Thank you for the reply! Based on this script, seem that we are selecting the subset that ABC.score > 0.02 and also not belong to promoter class except it's self-promoter? My EnhancerPredictions_threshold0.02_self_promoter.tsv is like this and it has 105,037 rows: chr start end name TargetGene TargetGeneTSS CellType ABC.Score chr1 713881 714381 intergenic|chr1:713881-714381 ATAD3A 1447522 K562_hg19_0501 0.050366 chr1 713881 714381 intergenic|chr1:713881-714381 RNF223 1009687 K562_hg19_0501 0.021288 chr1 713881 714381 intergenic|chr1:713881-714381 PERM1 917497 K562_hg19_0501 0.032612 chr1 713881 714381 intergenic|chr1:713881-714381 PLEKHN1 901876 K562_hg19_0501 0.038985 chr1 713881 714381 intergenic|chr1:713881-714381 KLHL17 895966 K562_hg19_0501 0.038383 chr1 713881 714381 intergenic|chr1:713881-714381 SAMD11 861120 K562_hg19_0501 0.064575 chr1 713881 714381 intergenic|chr1:713881-714381 FAM87B 752750 K562_hg19_0501 0.117851 chr1 752446 753000 promoter|chr1:752446-753000 FAM87B 752750 K562_hg19_0501 1.000000 chr1 762648 763363 promoter|chr1:762648-763363 LINC01128 762970 K562_hg19_0501 1.000000 chr1 762648 763363 promoter|chr1:762648-763363 LINC00115 762902 K562_hg19_0501 1.000000

But if I run awk '$25 > 0.02 {print$0}' EnhancerPredictionsAllPutative.tsv|wc -l, I got 116,067 if I run awk '($5!="promoter"||$17=="True")&&$25> 0.02 {print$0}' EnhancerPredictionsAllPutati ve.tsv|wc -l, I got 69,537

atancoder commented 4 months ago

Yes, the filtered file also removes self promoters. Take a look at that script as it outlines exactly how it goes from the all putative file to the thresholded file.