Revise parameter grid - Githubissues

cognoma / machine-learning

Machine learning for Project Cognoma

Other

32 stars 47 forks source link

Revise parameter grid #114

Closed rdvelazquez closed 7 years ago

rdvelazquez commented 7 years ago

Builds on #113 and revises the parameter grid in n.mutation-classifier as follows:

l1_ratio: changed from 0.15 to 0
alpha: changed from [10 ** x for x in range(-3, 1)] to [10 ** x for x in range(-10, 10)]
'n_components: changed from [50, 100] to a function that selects the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The function is shown below:
```
n_positives = min(y.sum(),len(y)-y.sum())
if n_positives > 500:
n_components_list = [100]
elif n_positives > 250:
n_components_list = [50]
else:
n_components_list = [30]
```

This PR also added stratify=y to the test_train_split and revised the markdown note about the gene (below cell 3) to be more general as opposed to just referencing TP53.

patrick-miller commented 7 years ago

This looks good to me.

I'm not sure if parameterizing n_components by a the number of positives as opposed to the % of positives is better. This only really matters if we obtain more data, which would probably lead to other design changes as well anyway.

rdvelazquez commented 7 years ago

Thanks for reviewing this @patrick-miller!

I'm not sure if parameterizing n_components by the number of positives as opposed to the % of positives is better.

I think it's only better (or different at all) when there are queries that don't use all the samples (that are subset by disease). For example:

Query A: 50% positives (100 samples, 50 positives)
Query B: 10% positives (5,000 samples, 500 positives)

I think Query B should use more components than Query A because Query B will likely need more components to capture a similar amount of the variance and Query B will be less prone to over-fitting than Query A. Let me know if that made sense.

patrick-miller commented 7 years ago

I'm not positive, but I think you are right.

rdvelazquez commented 7 years ago

I'm not positive, but I think you are right

Positive... I love a good pun 😃 (I'm terrible I know)

I'll give @dhimmel a chance to look at this if he wants before we merge it.

dhimmel commented 7 years ago

@rdvelazquez or @patrick-miller someone squash merge this!