Maybe more check while split data set.

claassenlab / pyPsupertime

Scalable reimplementation of psupertime in python

MIT License

1 stars 0 forks source link

Maybe more check while split data set. #1

Open awa121 opened 1 year ago

awa121 commented 1 year ago

I'm using your code, but I found maybe you can add “class number check of y and y_test” while splitting the dataset. Because in my dataset there are many classes, in some folds, after split the y and y_test may get different class numbers, which is due to a Bug. Here is the corresponding line for you.

https://github.com/claassenlab/pyPsupertime/blob/237c74f5058d4b59b02f4bf72f29b174b6e397f4/src/pypsupertime/model.py#L359C13-L359C26

awa121 commented 1 year ago

I'm using your code, but I found maybe you can add “class number check of y and y_test” while splitting the dataset. Because in my dataset there are many classes, in some folds, after split the y and y_test may get different class numbers, which is due to a Bug. Here is the corresponding line for you.

https://github.com/claassenlab/pyPsupertime/blob/237c74f5058d4b59b02f4bf72f29b174b6e397f4/src/pypsupertime/model.py#L359C13-L359C26

In detail, my data is long tail classification, there are only 2 samples in some classes.

JulianFlesch commented 1 year ago

Thanks for the feedback! Please note that this package is still in a beta phase and subject to frequent changes. (I should probably make a disclaimer about that ...)

In the line you mention the splitting tries to stratify for y. If you have too few samples to split, this might also mess with the CV parameter search (pypsupertime/parameter_search.py). I cannot do much more but add an error or warning, if the dataset cannot be stratified.

In the meantime, you could try duplicating the samples from the class with only two samples and used weights (sample_weight) to balance the training. Also you could use estimator_params={"early_stopping": False} to avoid splitting off an early stopping set. If you do so, consider setting max_iter to a lower number to avoid long runtimes. SGD should converge rather quickly.

awa121 commented 1 year ago

Thanks for the feedback! Please note that this package is still in a beta phase and subject to frequent changes. (I should probably make a disclaimer about that ...)

In the line you mention the splitting tries to stratify for y. If you have too few samples to split, this might also mess with the CV parameter search (pypsupertime/parameter_search.py). I cannot do much more but add an error or warning, if the dataset cannot be stratified.

In the meantime, you could try duplicating the samples from the class with only two samples and used weights (sample_weight) to balance the training. Also you could use estimator_params={"early_stopping": False} to avoid splitting off an early stopping set. If you do so, consider setting max_iter to a lower number to avoid long runtimes. SGD should converge rather quickly.

Thanks for your reply. Could you upload a example.h5ad file? I cannot find the "/path/to/data_sce.h5ad" in this project. Thank you so much.

awa121 commented 1 year ago

Thanks for your reply, there seems a bug: no definition of y_test and y_train. https://github.com/claassenlab/pyPsupertime/blob/main/src/pypsupertime/plots.py#L104C48-L104C48