Closed rdvelazquez closed 7 years ago
This looks good to me.
I'm not sure if parameterizing n_components
by a the number of positives as opposed to the % of positives is better. This only really matters if we obtain more data, which would probably lead to other design changes as well anyway.
Thanks for reviewing this @patrick-miller!
I'm not sure if parameterizing n_components by the number of positives as opposed to the % of positives is better.
I think it's only better (or different at all) when there are queries that don't use all the samples (that are subset by disease). For example:
I think Query B should use more components than Query A because Query B will likely need more components to capture a similar amount of the variance and Query B will be less prone to over-fitting than Query A. Let me know if that made sense.
I'm not positive, but I think you are right.
I'm not positive, but I think you are right
Positive... I love a good pun 😃 (I'm terrible I know)
I'll give @dhimmel a chance to look at this if he wants before we merge it.
@rdvelazquez or @patrick-miller someone squash merge this!
Builds on #113 and revises the parameter grid in
n.mutation-classifier
as follows:l1_ratio
: changed from 0.15 to 0alpha
: changed from[10 ** x for x in range(-3, 1)]
to[10 ** x for x in range(-10, 10)]
[50, 100]
to a function that selects the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The function is shown below:This PR also added
stratify=y
to thetest_train_split
and revised the markdown note about the gene (below cell 3) to be more general as opposed to just referencing TP53.