Closed adamklie closed 4 days ago
For the evaluations I have been using 3. The idea is that as long as we are capturing more biological signal with increasing K the total number of unique terms would keep increasing. When this value plateaus, we are probably adding redundant and/or noise components.
Makes sense. This is how I've been doing it for the dashboard so far so I think we are good here
Been using these functions that @aron0093 originally implemented:
Data
Can think of three ways to count terms:
1. All enriched terms
Count everything, including terms enriched multiple times in the same program (shouldn't happen right?) and terms enriched in multiple programs.
2. Unique within a program, but can be repeated across programs
i.e if
program1
andprogram2
are both enriched forgenesetA
we count it for both programs3. Unique across all programs
i.e. if
program1
andprogram2
are both enriched forgenesetA
, butprogram1
has a much lower adjusted p-value, we only countgenesetA
for program1**Note that I didn't use
count_unique(..., unique=True)
here because I think it arbitrarily selects which program to bin a term in when it is duplicated across terms, rather than selecting the one it is most enriched for.Which to use?
I think it depends. To me most of the time I think 2 is the right option since we could easily have redundancy between programs and we want that captured. But maybe we make this something a dashboard user can select?